Ping the Semantic Web.com: a pinging service for the Semantic Web

 

    One of the problems I found with the semantic web is how it could be difficult to find new and fresh data. Recently I was confronted with a problem: how to notify a web service that Talk Digger had new and updted semantic web data ready to be crawled (the SIOC and FOAF ontology for people familiar with semantic web technologies).

Then I questioned myself about why nobody, at my knowledge, developed a sort of weblogs.com or pingerati.net pinging service for semantic web documents? This solution already proved that it is working considering that weblogs.com archive and export millions of pings every day.

 

What is PingtheSemanticWeb.com?

PingtheSemanticWeb.com is a web service archiving the location of recently created/updated FOAF, DOAP or SIOC RDF documents on the Web. If one of those documents is updated, its author can notify the service that the document have been updated by pinging it with the URL of the document.

PingtheSemanticWeb.com is used by crawlers or other type of software agents to know when and where the latest updated FOAF, DOAP and SIOC documents can be found. So it requests a list of recently updated documents as a starting location to crawl the semantic web.

More information about supported ontologies can be found here:

 

Using the Bookmarklet

I greatly suggest to anyone to use pingthesemanticweb.com’s Bookmarklet. You only have to install this bookmarklet in your browser, and click on it from any Web page. If a FOAF, SIOC or DOAP document is found, it will be immediately indexed by the pinging service.

It is the easiest way for anyone to help PingtheSemanticWeb.com to find new documents to index.

 

How to install the Bookmarklet

Read the instructions on how to install the Bookmarklet (Browser Button) into your browser.

 

How does it works

You can use the URL of a HTML or RDF document when pinging PingtheSemanticWeb.com web service. If the service found that the URL points to a HTML document, it will check if it can find a link to a FOAF, a DOAP or a SIOC rdf document. If it founds one, it will follows the link and check the RDF document to see if SIOC, DOAP and/or FOAF elements are defined in the document. If the service found that the RDF document has SIOC, DOAP and/or FOAF elements, it will archive the ping and make it available to crawlers the export files. Otherwise it will discard it.

 

 

Custom needs, suggestions and bug reports

This service is new, so if you have any suggestions to improve it, if you find any bugs while pinging URLs or importing ping lists, or if you have any custom needs for you semantic web crawler of software agents, please contact me by email [fred ( at ) fgiasson.com], that way I’ll be able to help you out as quickly as possible.

Technorati: | | | | | | |

Supervized Search Indexing with Yahoo! Search Builder

Yahoo! Search Builder: The idea is great: the power of Yahoo!’s search engine with its colossal database with all the advantages (no spam) of supervised indexing. In fact, niche networks (groupd of people) will probably use this new service to make search engines for their niche domains and will meticulously add new crawlable sources over time. That way, no spam website will be indexed, the results will be much more accurate and useful and the result will be that users will spend less searching time.

Other search engines [Rollyo and Eurekster] already do that. The main difference is that they developed “social” features around the search results and Yahoo! didn’t. Some people think it is sad, but personally I think that Yahoo! just don’t care. Social features are cool, but for some purposes only, not for everything. But personally, the big difference is Yahoo!’s database compared to Rollyo’s and Eurekster’s.

Technorati: | | | | | |

Visualizing Web conversations using Talk Digger

In this article, I will talk about the recent developments with the alpha version of Talk Digger and how it could be use to visualize the interactions between conversations tracked by it.

 

Recent developments

Yesterday I started to crawl most of the URLs submitted to Talk Digger in the past 6 months and indexing all the results in its new database.

Right now Talk Digger is tracking about 2500 URLs (so it has about 2500 conversations), and it indexed about 80 000 sources (other web pages linking to these 2500 conversations).

These numbers are not big, but the preliminary results are quite impressive (in my humble opinion). In fact, each time new URLs were tracked, new conversations was created, new sources was indexed, I discovered new ways to use it, to discover new stuff, to visualize relations between the data, etc.: the patterns were starting to emerge.

 

Visualizing interactions between Web conversations

In only 30 minutes of conversation browsing, I noticed 7 interesting use cases (patterns) in the system. I will present all of them by describing what is happening with each of them.

I added two visualization tools in the right sidebar of each conversation page.

 

 

The first tool

The first tool will help the users to answer to these two questions:

  • What are the other conversations that are talking about the current one?
  • What are the conversations the current one is talking about?

 

 

The current conversation is the one in light-blue, in the middle of the panel: “Talk Digger: find, follow and join discussions evolving on the Internet”.

From there, I know that the “Talk Digger: find, follow and join discussions evolving on the Internet” conversation is talking (in relation) with the other conversation “Frédérick Giasson – Computer scientist, software developer and consultant”.

It makes sense considering that I am the creator of Talk Digger and that the conversation “Frédérick Giasson – Computer scientist, software developer and consultant” is created by the URL of my personal web page.

I can also see that the conversations: “3spots”, “Library clips”, “ Digg Tools”, “decor8”, are also in relation with the current one.

That way, I can easily visualize the relationship between the conversations tracked by Talk Digger.

 

The second tool

The second tool will help the users to know what are the different conversations tracked by Talk Digger that came from the same source (URL).

 

 

From this panel, I know that Talk Digger is tracking two other conversations closely related to the current one: “Talk Digger Tools: Bookmarklet” and “Talk Digger Tour: Use the bookmarklet”.

In reality, these two other conversations are two different pages from a same domain name: talkdigger.com

Okay, now it is the time to check at the use cases to understand how these two tools can be used.

 

Use case #1: A normal blog or personal webpage.

This is the case of a conversation evolving around a single blog (or personal web page) and its interactions with other conversations:

 

 

In this example, the current conversation is the one of my personal web page.

What is interesting here is that we can see how it relates to itself. We can see that from my main page, I link to two other pages that have their conversations tracked by Talk Digger.

Also, I see that “jotsheet – blog o’ tom Sherman” also has a relation with me. In fact, tom Sherman is a old user of Talk Digger and talked about it in many of his blog posts.

 

 

I can also see other pages, from the same domain name, which has a conversation tracked by Talk Digger.

The difference between these results and the above ones is that they are not necessarily linking together (in opposition to the above relations).

 

Use case #2: Discovering the relation between a web page and its blog

 

In this example, I found the relation between a normal website (Library Law) and its blog (LibraryLaw Blog). What is interesting is that if you go on the Library Law’s web site, its blog is not clearly displayed. However, the relation between the two is clearly apparent.

 

Use case #3: Topic specific blogs and web sites.

Another interesting pattern is the one created by topic-specific blogs and web sites.

 

 

In this example, I used the Micro Persuasion blog wrote by Steve Rubel. This blog is focus on Web 2.0 news. As you can see, the “Micro Persuasion” conversation is relation (talk about) the conversations of other Web 2.0 services like “del.icio.us”, “Rollyo” and “Netvibes”.

So the relations here are topic centered.

 

Use case #4: Egocentric blogger.

This use case is fascinating because it shows how someone can relate to itself.

 

 

In this example, Robert Sanzalone, the writer behind the Pacific IT blog started track conversations for many of his blog posts. That way, we can easily visualize of one post is relating with the others.

 

Use case #5: Who cares about my photos?

Some people also care about what people say about their photo.

 

 

If we check the conversation evolving around nattu’s Flikr photo album, we will see two things:

  1. That the conversation created by this photo album is in relation with another conversation tracked by Talk Digger.
  2. That there is many other people that care about the conversation evolving around other people’s photo album.

 

Use case #6: In the news

Other people like to know what is the conversation evolving around specific pieces of news.

 

 

This is really interesting. We have a piece of news from ZDNet called “The new meaning of programming”. We instantly know that it relate with another conversation called “SocialNets & The Power of The URL”.

We also know that later, other pieces of news talked about it: “Mark Cuban is Wrong”, etc.

This is really interesting to find out how news could relate one between the other.

 

Use case #7: Online communities’ users.

Other people like to know the conversation evolving around their online persona present on online communities’ web site like MySpaces and LiveJournal.

 

 

In this example, we want to see the conversation about the user “2220s” on MySpaces. As we can see, 22-20s’s LiveJournal is talking about him.

We can also see a list of conversations evolving around many other MySpaces users’ page.

 

Conclusion

As we saw, depending on the source (URL), many different relationship patterns can emerge from Talk Digger’s conversations.

These preliminary results are quite exciting considering that I just started to crawl URLs since yesterday. I think the current infrastructure I developed in the past months is promising, the next steps is to continue crawling URLs and to get users using it.

 

Subscribe to the Alpha version of Talk Digger

If you would like to test these features, you can always subscribe for a user account. The next round of account created is planned for mid-August.

Technorati: | | | | | | | |

Norvig(Google) and Berners-Lee on the Semantic Web at AAAI 06

Many people are talking about that piece of news (Google exec challenges Berners-Lee): some will say that Tim is right; other will say that Peter is right. No one is right or not, everything depends on your situation in that environment (created by the Semantic Web).

Everybody knows Tim Berners-Lee, but everybody should also know that Peter Norvig is not a second class citizen. He wrote, with Stuart Russel, probably the best and most comprehensive book in the field of Artificial Intelligence, he is the director of research at Google, etc.

The best blog post I read about that subject, and that resumes really well my point of view, is the one wrote by Danny Ayers: Incompetents Revolt!

As reported by the CNet article:

 

Peter said:

“What I get a lot is: ‘Why are you against the Semantic Web?’ I am not against the Semantic Web. But from Google’s point of view, there are a few things you need to overcome, incompetence being the first,” Norvig said. Norvig clarified that it was not Berners-Lee or his group that he was referring to as incompetent, but the general user. […] We deal with millions of Web masters who can’t configure a server, can’t write HTML. It’s hard for them to go to the next step.”

 

Most of the thing I read vis-à-vis that declaration was talking about the “incompetence of users toward Semantic Web technologies”. However, I think that the most important point here is that Peter takes the time to say: as the director of research at Google, a Billionaire Company, I have some reserve vis-à-vis the Semantic Web.

Google have some reserver vis-à-vis it, but why? For technical considerations? For business vision? Anything else? I don’t know, and they probably don’t know either. Everybody fears the unknown. Why Google wouldn’t? They are, and they are probably because they can’t grasp what is at sake with their company, just like everybody else in the World.

 

Peter said:

“The second problem is competition. Some commercial providers say, ‘I’m the leader. Why should I standardize?’ The third problem is one of deception. We deal every day with people who try to rank higher in the results and then try to sell someone Viagra when that’s not what they are looking for. With less human oversight with the Semantic Web, we are worried about it being easier to be deceptive,” Norvig said.

 

Danny wrote:

“Competition and standardization – yes, certainly issues for the Web. But the companies that thrive in this environment tend to be the ones that embrace open standards. The fact is that the rest of the world is likely to be bigger than any leader. Respect the long tail.”

 

I add:

Hell yes he is right! If I put myself in the skin of any shopkeeper, restaurant owner, etc, do I want that people have a semantic access to my information [in these cases: price of merchandise, delivery procedures, etc]? Hell yes I want! However, if I put myself in the skin of a Google exec, do I want? I am not certainly sure that I want… give me sometime please, so I’ll be able to rework my business plan accordingly.

 

Later Tim said in answer to Peter:

“Berners-Lee agreed with Norvig that deception on the Internet is a problem, but he argued that part of the Semantic Web is about identifying the originator of information, and identifying why the information can be trusted, not just the content of the information itself.”

 

Yesterday I wrote on the SIOC Google Group “that I don’t think that the semweb will be crawled as Google crawl current websites. I think that the first step will be to use semweb technologies to let web services interact together with trusted sources of information. From there, network of trusted sources will emerge, etc, etc etc.    

I think that people tend to forgot all the “trust” layer of the semweb when they talk about tricking semweb agents or search engines (in fact, trusts relationships will be explicit or inferred). Think about memetrackers like techmeme.com. The system started with a list of trusted bloggers and news sites, and expended its list by adding trusted sources from them, etc.

But noturally, once more, Danny’s writing summarize the whole point much better:

But anyhow, is Norvig really suggesting that Google are currently solving deception issues via human oversight? Whatever, proof and trust are key considerations for Semantic Web systems. The foundations are built into the theory, logic provides one route to addressing the issues. Even if that formalism is ignored and statistical approaches taken, the use of common languages makes data more amenable to analysis. Probabilities and uncertainties can be expressed as logical statements on top of the base languages. However you approach it, the answer is mo’ better data, not less.

 

Finally, it is not good or bad; it only depends on your position in the environment such a Web would create. I think we inevitably go in that direction, the only thing is that some people will need more time than other.

Technorati: | | | | | | | | | |

Hack for the encoding of a URL into another URL problem with Apache and mod_rewrite

 

While configuring my new dedicated server to support the new generation of Talk Digger, I encountered a really strange bug that emerged with the interaction of urlencode(), Apache and mod_rewrite.

It took me about a working day to figure out what was the bug, where it could come from, searching information to know if I am the only one on earth to have it, fixing it, etc.

I found out that I was not the only one to have that bug, but I never found any reliable source of information to fix it. Because I am using Open Source softwares, I think it is my duty to post the fix somewhere on the Web and this “somewhere” is on my blog. Normally I do not post such technical articles, but considering that it is an interesting bug, that many people expect it and that there is no central source of information that explain how to fix it from A to Z, so I decided to take a couple of minutes to write that article.

 

What is the context?

I have to encode a URL into another URL.

For example, I would like to encode that url:

www.test.com/test.com?test=1&test2=2

Into that other url:

www.foo.com/directory/www.test.com/test.com?test=1&test2=2

To do that, I have to encode the first url; the result would be:

www.foo.com/directory/www.test.com%2Ftest.com?test=1&test2=2

 

What is the bug?

The problem we have is that when you try to apply RewriteRule(s) to these URL using Apache (1.3) and the mod_rewrite module, mod_rewrite will not be able match any of its rules with that url.

By example, if I have a rule like:

RewriteRule ^directory/(.*)/?$ directory/redirect.php?url=$1 [L]

mod_rewrite will not be able match the rule with the URL even if it matches. The problem, as cited above, is the encoding process of URLs between Apache and mod-rewrite.

 

The explanation

The problem seems to be that the url passed to mod_rewrite seem prematurely unencoded. With a single encoding (urlencode( ) in PHP) of a URL, the RewriteRule(s) will not be matched if the “%2F” character is in the URL, or if it is (no %2F character in the url) then the substitution will not be necessarily completed.

After having identified the problem I found the bug entry of the problem: ASF Bugzilla Bug 34602

It is the best source I found, but it was not complete to resolve the problem I had.

 

The simplest hack, but the ugliest!

The simplest fix is to double encode the url you want to include in your other url. (by example, in php I would encode my url with: urlencode(urlencode(“www.test.com/test.com?test=1&test2=2” )); ). That way, everything will work fine with mod_rewrite and it will match the rule.

The problem with that easy fix is that it adds a lot of ugly characters in your URL. Personally I find that unacceptable, especially when we know that mod_rewrite is there to create beautiful URL!

 

The second hack

The second fix is to re-encode the url directly in the mod_rewrite module. We will re-encode all the url at the exception of the “%2F” character (because it is a glitch (bug?) not related with mod_rewrite but probably Apache itself). What you have to do is to create you own urlencode( ) method to encode all characters except “/”. That way everything will works as normally, except that the “/” character will not be encoded.

 

Security related to that hack

I don’t think this fix add a security hole if we think about code injection in URL or other possible hole. I’ll have to further analyze that point to make sure of that.

 

Future work

In the future it would be great to find where in Apache the “/” (%2F) character is prematurely decoded, or where we could encode it just before it is passed to mod_rewrite.

 

THE HACK

Okay, there is how to install that hack on your web server.

I only tested it on Apache 1.3.36 and mod_rewrite. I have no idea if the same problem occurs with Apache 2.

 

Step #1

The first step is to create your own urlencode( ) function that will encode a url without encoding the “/” character. A simple PHP function that would do the job could be (it is really not efficient, but it will do the job for now):

function url_encode($url)
{
     return str_replace(“%2F”, “/”, urlencode($url));
}

 

Step #2

The second step is to change the code in mod_rewrite.c to re-encode the url.

You have to replace the mod_rewrite.c file into Apache’s source code at [/apache_1.3.36/src/modules/standard/] by this one:

The hacked mod_rewrite.c file

 

Step #3

Then you have to recompile/re-install your Apache web server.

 

Finished

Everything should now work fine. In your server-side scripts (PHP for example), you will have to encode your url with the new url_encode( ) function. Then everything will work just fine with mod_rewrite and it will matches the rules as expected.

 

The last word

I hope that this little tutorial will help you if you have the same problem as I had. Please point me any error/upgrade/code-enhancement in the comment section of that post, it will be really appreciated!

Technorati: | | | | | | | | | | | |