Reaching at least 600 000 people with 19 contacts

Sergio Fernández and Iván Frade lately started a really interesting experience called Futil. This is a small computer program that got Sergio’s FOAF profile as seed person to discover new people from its relations (the friend of a friend of a friend, etc). The experience is to discover how much people you can find only starting from people you know. So far, Sergio’s Futil program found about 600 000 people. I guess that it should discover around 2 500 000 people before it finishes.

The experience is quite interesting in many ways. It gives some insight on how people are connected together, and even more important in today’s web, how communities of users are interacting with the Web.

The graph bellows show how Futil is discovering these profiles. The Y-axis represent the number of pool of people it has to get from the Web and the X-axis is number of profiles it got so far.

The first 50 000 people Futil found were coming from different places on the web. It could be a personal web page, the web page of an organization, etc. Then, eventually, Futil found a couple of links to people belonging to an online community called Tribe. People of that community only link to other people of the same community. What is interesting is that as soon as Futil started to crawl a couple of people of that community, it eventually found all the 200 000 people belonging to that community. Now the same thing is currently happening with another community, much bigger, called Livejournal, with about 2 million of users.

Why Futil only crawled people from the same community? The answer is easy: because these communities are closed. They don’t interact with the rest of the Web. So one user can only link to other community users.

How to open a community and let its users interact with other users, of other online communities?

A first step would be to let people describing their relationship with other people outside of their community.

One example of such an online community is Talk Digger. This system let its users importing (and synching) their FOAF profile from another location on the Web. It also let its users defining their relationship with other people outside of the community. By example, a user can say that he knows the people X and Y on Talk Digger; but it can also specifies that he knows a person Z from outside of the community, or from another online community.

In fact, if other online communities would add such a feature to their system, inter-communities communications and relationships could then be possible.

You can read an old blog post that explains how Talk Digger is handling FOAF profiles.

Why online communities system should open themselves?

Why a user will use an online community and not the other? It depends; I would say that it principally depends on: the topic(s) of the community, the people he knows in that community, and the user interface of that community (after all, one interface don’t work for everybody).

So, why online communities shouldn’t let their users interacting with other online communities users?

I think it is an error caused by the fear of loosing users and it explains why Futil behaved that way: because current online communities doesn’t let its users interacting with people from outside of the community.

Futil is pinging Pingthesemanticweb.com as well

Well, each time Futil discover a new FOAF profile it pings Pingthesemanticweb.com. So far it pinged about 300 000 new FOAF profiles. It is a good example of how this semantic web pinging service can be used.

Now, everybody has access to these new FOAF files. The best thing would be that such online communities (like Tribe.net and Livejournal.com) would ping the service each time there is a new user, or each time a user update its profile. But in the mean time, independent crawlers such as Futil do the job very well.

Conclusion

The thing I wish now is that future online communities start to let their users interacting with users from other communities. A good start in that direction would be to let them describing their relationship not only with people of the same community, but also with people from outside of the community. By then, meta-communities should start to emerge.

13 thoughts on “Reaching at least 600 000 people with 19 contacts”

Vaclav Synacek

January 22, 2007 — 8:36 am

Nice post about an interesting project.

However there is one more reason why a user will use an online community and not the other: privacy. An example would be LinkedIn.com. This network lets users decide how much of their information is considered public and what is only to be seen by direct friends. The connections are also visible only to close friends so you cannot see who your boss is playing golf with until you are one of them.

I personaly think there will always be people with reasons for not making all personal data and contacts public and so there will be social networks who will never open up. It is worth trying but I don’t believe this can get internet-wide.

On the other hand, analysing one’s contacts spread over several networks might be possible even within closed networks. SIMILE’s project Piggy Bank (http://simile.mit.edu/wiki/Piggy_Bank) has scrapers for Orkut and LinkedIn social networks which both scrape the data to FOAF. Combining these data in one’s personal Piggy Bank is close to making a private network of friends cross the networks while respecting contacts’ privacy. Hopefully someone will soon contribute similar scrapers for other social networks.

Reply
Fred

January 22, 2007 — 4:38 pm

Hi Mr. Synacek,

Yeah, totally agree here. Some network (geni.com is a new one about family tree) should keep some or all information about their user private: this is non negotiable. But some time, some people want some of their information to be public domain.

No, but if we take tribe.net and livejournal.com, it can be done 🙂

I like your example of piggy bank. It is sure that such a thing can be done. A specialized web page scrapper could be developed to get all the information available for all Orkut users even if they don’t make the information available in RDF format. And this could be done for virtually any web site. But I wouldn’t like to be the one that would develop them 😉

By the way, would it be possible for you (I think you are the developper of that scracpper, no?) to extend it beyond Piggy Bank’s environment? So creating such a crawler that would create the FOAF profile of each Orkut users?

Salutations,

Fred

Reply
Dan Libby

January 22, 2007 — 5:14 pm

Hi, you might be interested in a project of mine, videntity.org. We allow anyone with an openid identity *from all over the web* to create a profile and then create relationships (XFN) with other online identities. All of it then gets exported as foaf.

A description of the social networking aspect of the site is here:
http://wiki.www.videntity.org/wiki/Social_Networking_Unlimited

btw, if this blog used openid, I could just comment using my pre-existing openid identity and preferences.

Reply
Vaclav Synacek

January 23, 2007 — 6:42 am

Hi everybody!

To Fred: I’m the original author of LinkedIn scraper. Orkut scraper was done by Ben Hyde. I’m not Orkut user and thus I have never seen Orkut scraper in action.

Answer to your question if these scrapers can be adapted to work outside of Piggy Bank is a bit more difficult. Generally it should be possible – they are standard JavaScripts with dependencies on Firefox XPath processor and some Piggy Bank specific RDF processing calls (not really hard to replace), it is open source and anybody is free to do it. However both scripts are meant to be personal tools and to work inside your browser, they only work after you log in to the specific social network. They only can scrape what you as a user are allowed to see. So in the case of LinkedIn: every user can see as far as his friends’ friends profiles, not further, so Piggy Bank can scrape also only this far. The users of LinkedIn agreed to share their profiles with direct friends and their friends, but nobody else – this policy is hard coded into the web interface of LinkedIn and Piggy Bank being a browser plugin cannot break out of these policy rules. No magic here. This is what I meant by while respecting contacts’ privacy.

Conclusion: while porting some scrapers outside of Piggy Bank environment might be possible and interesting thing to do, I don’t see a point in porting these particular two scrapers as they rely on logging in the social networks and thus will remain personal tools anyway.

To Dan Libby: videntity is an interesting beginning of a project. I’m not sure I got the whole idea, but I think of it as an ‘OpenID provider with XFN/FOAF file hosting and web hosting the myspece way’. This might be a more open alternative to myspace and the like, but I don’t see much to offer to the people having a proper web hosting where they can get OpenID, can make their own FOAF or XFN files. So I can’t wait for the ‘future plans’ being implemented.

Reply
Danny

January 23, 2007 — 8:03 pm

I’m not sure if this relates but another reason I can think of where an online community wouldn’t want to let users outside of the community interact is scientific research. What I mean is, a few years ago, I heard on the radio that the scientific community wasn’t happy with the Internet now that it became hugely popular. It made their research difficult because scientists had to spend a lot of time filtering out the information from non-scientific people (example, ads, conspiracy theories, forums, flame wars, 13 year old chatspeak, etc.). The program said that they were planning to launch an exclusive type of Internet for the sole purpose of scientific research. The original use of the Internet! I don’t know if it actually happened.

Reply
Fred

January 23, 2007 — 9:07 pm

Dan Libby: I just take a look at videntity.org: it seems great! What interest me more here is the fact that each profile are exported using FOAF. So, would it be possible to get a list of FOAF from videntity? That way I would include them into Pingthesemanticweb.com. Also, would it be possible for you to ping it each time a new user create an account, or each time a user update its profile? That way, other people could do cool things with the data of your users.

I am not sure that I will support OpenID with that version of the blog since I would have too much time to put in that and that there is no plugin for opened available for that version of b2Evolution 🙂

Vaclav Synacek: Yeah well, it is sure that if it is wrote in JS, then it couln’t be that useful to such a project 😉

By the way, I was wondering without having the time to investigate further, is the data available in the Semantic Bank available to public? If so, it would be great if the data could be indexed by PTSW and if the semantic bank could ping it each time a new/updated file is indexed into the bank.

Danny: yeah you are right. But there are specialized database of information (normally in university) that filter all that information for them. It is sure that if you try to find all you information on Google, you will have to spend a lot of time filtering out all the crap 🙂

Take care all,

Salutations,

Fred

Reply
Vaclav Synacek

January 24, 2007 — 6:01 am

Hi Fred,
as to your question about PTSW accessing Piggy Bank/Semantic Bank:
When the data is scraped by any scraper it is saved to Piggy Bank’s database that runs inside one’s browser and is accessible over HTTP on some high port. This data is on users’ computers so it would by quite hard to access for PTSW spiders. However users may also publish some of the data from their Piggy Banks to public Semantic Banks they have accounts in. These Semantic Banks are installations of Longwell project (http://simile.mit.edu/wiki/Longwell). Their not so long list is at http://simile.mit.edu/wiki/List_of_banks . The general free for everybody to use bank (http://simile.mit.edu/bank/) contains nearly 500 FOAF People and hundreds of other data.

Geting the RDF from the banks is trivial, just follow the alternate link. I don’t know about pinging PTSW on data change. Ask the SIMILE developers about that.

It might be very interesting if you would set up a semantic bank yourself. And promote publishing Piggy Bank data to your bank. This way you might get a lot of RDF data scraped over the Internet by Piggy Bank users for your project.

Even better would be if all the data indexed by PTSW would be accessible through Longwell faceted browser interface. This would be a semantic web killer app. But this is more of a dream than a near future project, I guess.

Reply
Fred

January 25, 2007 — 1:19 pm

Hi Vaclav,

Yeah I was talking about the public semantic bank. I will take a deeper look to it later.

Yeah well, putting longwell over PTSW could be a good idea, but I have other plans that will roll out later in february (so keep checking this blog 😉 ). In fact, longwell is nice, but my mother, my friends, etc don’t like it: too complex, need to much knowledge to use it, etc. So this is the reason why I have other plans.

Thanks,

Take care,

Fred

Reply
Dan Libby

January 25, 2007 — 7:01 pm

Fred, I’ve begun pinging Pingthesemanticweb.com whenever a profile is added/updated on Videntity.org.

A full dump into foaf format would be a bit more coding work, as the files are programatically generated at this time, not real on-disk files. I did however add tag in each profile page to aid with discoverability of the foaf files, so they could pretty easily be scraped/spidered starting with this directory page.

Reply
Dan Libby

January 25, 2007 — 7:03 pm

add tag should be “add a <link> tag”.

Reply
Fred

January 25, 2007 — 10:16 pm

Hi Dan!

Wow this is great! It has been fast 🙂

This is not a problem for the current list of profiles. In fact, the simple way for me is to get a list of web page, and then crawling them (PTSW already try to get elements from HTML document to RDF documents.

So if it would be possible for you to generate me this list in a list of URL separated by return carrier, I could start to crawl them tomorrow or over the weekend. That way I would only have to feed it with them. (in fact it would be a small agent that would read the list and then pinging it with them, anybody could do that).

Thanks!

Salutations,

Fred

Reply
Nishad H. kaippally

February 23, 2007 — 3:29 am

I was trying out talkdigger. I presume you have worked on it. It does not find any malayalam blogs. There are atleast a thousand blogs written in this language.

PLease let me know why a large majority of asian languages (wich includes 9 indian languages) are not part of your search results.

Cheers

Reply
Fred

February 23, 2007 — 8:52 am

Hi Nishad,

Well, Talk Digger is not a traditional search engine even if you can search for keywords. In fact, the first goal of this search engine is to find who link to a specific webpage (your blog?).

What I would suggest you if you don’t find any malayalam blogs would be to put the url of one of these in talkdigger and then checking who links to them. Then, starting to browse them from a see blog.

for example, there are the people linking to your blog:

http://www.talkdigger.com/conversations/mallu-ungle.blogspot.com

I hope you will find what you are searching for.

Take care,

Fred

Reply

Frederick Giasson

Machine Learning, Engineering & Data

Reaching at least 600 000 people with 19 contacts

13 thoughts on “Reaching at least 600 000 people with 19 contacts”

Vaclav Synacek

Fred

Dan Libby

Vaclav Synacek

Danny

Fred

Vaclav Synacek

Fred

Dan Libby

Dan Libby

Fred

Nishad H. kaippally

Fred

Leave a Reply Cancel reply