December 2005 – Frederick Giasson

New Talk Digger feature: Regional view of results

December 31, 2005May 21, 2006 Frederick Giasson

I just released a new feature for Talk Digger. The idea of the feature is born with a wish of a PR worker called David Jones that wanted to be able to see and sort results per regions (countries). It was important for him considering that his client cared more about the comments from people in their targeted markets than the others. I found that that feature was essential for Talk Digger; not just for marketing and PR workers but also for everybody. What I like with this idea is that it put a touch of humanness in the digged conversations. It gives a new metric to users to try to analyze who are talking in a conversation. So I take the last two days to develop and release that new feature.

What this new feature is all about? It is called Regional setting. This setting let you enable the regional view of each result. If that feature is enabled (by default it is disabled) a flag of the country where the server that host the resulting web will appears. This option is useful when you try to find people living in a specific country that talk about an URL. This option is especially helpful for marketing and PR workers that have to do regional searches for the products of their clients.

How to interpret the flags? The flag appearing beside a title shows the country where the web page is hosted. If a Japanese blogger host his blog in America, then you will see the flag of the United-States except if he do not have a generic domain name (.com, .net, etc.) but a country one (.jp). However, people generally take their country domain names or at least host their web pages with a local web hoster. Considering the situation, I would say that 70% of the displayed flags represent the country where the creator of the result lives.

If you enable that feature, you will be able to sort the incoming results by countries. For example, here the first results will be the Canadian pages and all the others will be grouped by countries.

I hope that you will find that new feature, another one that saw the day by my interaction with Talk Digger users, as useful as I.

Happy New Year!

The business model of a Semantic Web service

December 27, 2005May 21, 2006 Frederick Giasson

What could be the business model of a semantic web service? First of all, what could be a Semantic Web service? We could define it as a web service that broadcast his content using formatting technologies like RDF or OWL. The content is formatted in such a way that software agents can easily read it and could infer knowledge. The key principle of a semantic web service is his openness.

I was talking with one of the creator of the search engines used by Talk Digger yesterday. During our conversation he asked me:

Are you planning to share any revenues with the blog engines? Do you have a business model for Talk Digger?

His question was legitimate but the answer was short: no. No I do not have any business model for Talk Digger. Right now it is an experimentation; a personal little research project. I could share revenues, however I doubts they could interest anyone considering that it merely pay the hosting fees.

However, I already thought about it and the question is good: What could be a business model for a service such as Talk Digger. Even more: what could be a business model for a Semantic Web service; a service that broadcast his computations at large, without any fees (Talk Digger could become such a service… it is a wish, remember?).

I have no idea.

John Heilemann wrote in the New York Metro:

“Alan Murray wrote a column in the Wall Street Journal that called Google’s business model a new kind of feudalism: The peasants produce the content; Google makes the profits,”

It is right, and not just for Google. Search Engines gather and index web content from everywhere. They even cache entire web pages content and publish them without any kind of permissions (remember the “cache” feature on many search engines?). I remember that secret US military documents have already been indexed by Google (just a rumor or a fact? I don’t remember). I also know that Google index millions of books content without any kind of permissions.

So, the question remains: what could be the business model for Talk Digger? And if revenues are generated with such a model, should I share it with other search engines companies? Sure I should; they can ban the IP address of my crawler at anytime. However, is it at their advantage? Considering the picture I described of some current search engines, why I couldn’t scrap their web page for some results? If we think about it, their results pages are web document like any other web pages on the Internet. If they can scrap others’ web page, why I could not too?

At the end everything is about money. However, they told us that they democratize the Internet by making it searchable. I told them that I democratize it too by aggregating their results in a “novel” way.

What is the Internet? The democratization of World’s information or a money making cow? I hope no.

Preliminary analysis: some results of the topic-extracting module of Talk Digger

December 21, 2005May 21, 2006 Frederick Giasson

As soon as finished and released the last version of Talk Digger I started to work on a new prototype module that tries to extract topics of returned results by search engines. What are these topics? They are the topics that evolve in a conversation (and a conversation is a set of articles that link to a specific URL returned by different search engines).

I release these preliminary results because I find them somewhat interesting (so it could possibly interest another person too; in fact I read this blog post by Anjo Anjewierden yesterday when I developing these tests, so I thought that I could write a little something on the subject)).

These results are based on the results returned by Technorati with a search on a recent article of the BBC. The set of raw texts returned is defined by:

DOCUMENT 3

URL: http://www.jnoelbell.me.uk/2005/12/21/so-much-news-so-little-time/

be resolved shortly. I shudder to think about the people trying to get to the airports in a day or two. secondly, thank god people are coming to their senses . â€œintelligent designâ€� is just a pretext for promoting religion, which has no place in the public schools. you donâ€™t like it?
[…]

DOCUMENT 7

URL: http://godcountryyale.blogspot.com/2005/12/suck-it-fox-news.html

wild glory days when I got linked to from Not Even Wrong… Anyway, big news today is that the “intelligent design” case in Dover got struck down ( BBC , CNN), an event that was made all the merrier due to the fact that I first heard about it on Fox News while channel surfing. If you
[…]

Click here to see the whole set

The next step is to perform some lexical analysis techniques to ‘purify’ the raw text. The resulting set of purified texts is:

DOCUMENT 3

resolve shudder think people try airport sense design pretext religion place public school

[…]

DOCUMENT 7

wild glory day link wrong today design case dove down event fact first while channel surf

[…]

Click here to see the whole set

As you can notice, only nouns remain. The reason is simple: I assume that the words that have the greatest semantic meaning to describe topics are nouns. In the next steps, verb, adverbs and adjectives will possibly be added to these sets because of their possible semantic relations with these nouns in other conceptual domains.

The 10 most frequent words of this set will create the set of possible topics of the conversation. The set is defined by:

there [Frequence: (2) Tag count: (0)]

teach [Frequence: (3) Tag count: (0)]

today [Frequence: (3) Tag count: (13)]

federal [Frequence: (3) Tag count: (0)]

class [Frequence: (3) Tag count: (190)]

school [Frequence: (3) Tag count: (108)]

sense [Frequence: (3) Tag count: (8)]

judge [Frequence: (3) Tag count: (3)]

pretext [Frequence: (3) Tag count: (0)]

design [Frequence: (11) Tag count: (13)]

An interesting metric I make explicit in these results is the “tag count”. The tag count is the number of time the word appears in The Brown Corpus. It tells us what the “popularity” of the word is. So if I have to choose between two words with the same meaning, I will choose the one with the greatest tag count because it is the one that is the most used in English literature.

The next step is trying to find new topics with semantic relations with the existing possible ones, or to strengthen the currents one.

If you check a lexicon, you will see that each word can be defined by one or more sets of synonyms. In the current example I take the assumption that a word is defined by all his synonyms sets (it is an assumption I do to make things simpler, but in real world, I would have to find which of the synonym sets define the words by his meaning in that context). I make the guess that other words from the same article and the other articles (belonging to the conversation) will smooth the error’s effect on the results.

So, there is the set of possible topics augmented by the synonym sets of each words belonging to the set of possible topics.

social_class, socio-economic_class, course_of_instruction, course_of_study, course, category, family, division, year, grade, form, civilise, civilize, schooling, schoolhouse, feel, signified, shoal, schooltime, cultivate, train, educate, school_day, classify, sort, learn, Blackbeard, Edward_Thatch, Thatch, instruct, at_that_place, on_that_point, in_that_respect, thither, in_that_location, Edward_Teach, Teach, Fed, separate, sort_out, assort, federal_official, Federal_soldier, now, nowadays, Union, Union_soldier, sensory_faculty, sentiency, project, aim, intention, guise, pretense, evaluator, stalking-horse, pretence, intent, purpose, invention, figure, designing, innovation, excogitation, blueprint, conception, justice, contrive, jurist, common_sense, try, gumption, horse_sense, sentience, sensation, mother_wit, adjudicate, good_sense, estimate, approximate, guess, magistrate, label, pronounce, gauge, pattern, Federal, plan, pretext, teach, there, federal, today, judge, class, sense, school, design

The more interesting words of this set are:

pattern [Frequence: (2) Tag count: (9)]

Federal [Frequence: (3) Tag count: (0)]

plan [Frequence: (3) Tag count: (43)]

pretext [Frequence: (5) Tag count: (0)]

teach [Frequence: (5) Tag count: (0)]

there [Frequence: (6) Tag count: (0)]

federal [Frequence: (6) Tag count: (0)]

today [Frequence: (7) Tag count: (13)]

judge [Frequence: (10) Tag count: (3)]

class [Frequence: (12) Tag count: (190)]

sense [Frequence: (12) Tag count: (8)]

school [Frequence: (13) Tag count: (108)]

design [Frequence: (25) Tag count: (13)]

NOTE: If you check, you can think that the frequencies are not good. The reason is that I added the frequency of the previous sets with the ones of the synonyms set.

There are three interesting facts: (1) the appearance of the concept “plan”; (2) the upgrade of the concept “school” forced by his semantic links with the synonyms sets of the other words belonging to the set; and (3) the downgrade of the concept “pretext”.

The current set of possible topics is now defined by the 10 most frequent nouns we extracted and the synonym sets of each of these words.

The final step performed to find the topics of a conversation is to augment the set of possible topics with the words that describes the same concepts as the one in the set (the sister concepts). The resulting set is defined by:

Texas_Independence_Day, February_22, March_2, Washington’s_Birthday, St_Patrick’s_Day, April_Fools’, March_17, Saint_Patrick’s_Day, February_14, St_Valentine’s_Day, February_2, Groundhog_Day, holiday, Tet, Lincoln’s_Birthday, February_12, Saint_Valentine’s_Day, Valentine’s_Day, Valentine_Day, April_Fools’_day, All_Fools’_day, Father’s_Day, June_14, Flag_Day, June_3, Citizenship_Day, September_17, October_24, United_Nations_Day, American_Indian_Day, Davis’_Birthday, Jefferson_Davis’_Birthday, Patriot’s_Day, April_14, Pan_American_Day, May_Day, First_of_May, Armed_Forces_Day, Mother’s_Day, May_1, January_19, Robert_E_Lee_Day, old_age, middle_age, adulthood, salad_days, geezerhood, deathbed, commencement_day, Arbor_Day, Admission_Day, bloom_of_youth, mid-nineties, golden_years, mid-sixties, sixties, seventies, mid-seventies, nineties, mid-eighties, eighties, degree_day, November_5, market_day, ides, election_day, polling_day, Walpurgis_Night, New_Year’s_Eve, Halloween, Robert_E_Lee’s_Birthday, December_31, payday, red-letter_day, leap_day,

[…]

measure, time, estimate, dull, strike, age, point, gauge, dissolve, denature, label, indicate, intention, order, acquaint, obscure, resolve, get, sensitize, moderate, sensitise, blunt, blur, division, contrive, take, draw, purpose, tame, report, course, try, construct, pattern, run, bring, touch, season, think, life, activate, break, grade, set, shift, feel, loosen, sense, year, night, project, convert, plan, judge, school, turn, figure, separate, train, develop, aim, transform, class, make, form, design

Click here to see the whole set

As you can notice, there is a little exponential explosion. This is a problem and this is the reason why I should take decisions, at each step, to keep the best words that could describe the topics of a conversation.

The most interesting words in this new set are:

think [Frequence: (19) Tag count: (0)]
life [Frequence: (19) Tag count: (107)]
activate [Frequence: (20) Tag count: (2)]
break [Frequence: (20) Tag count: (0)]
grade [Frequence: (20) Tag count: (17)]
set [Frequence: (21) Tag count: (24)]
shift [Frequence: (22) Tag count: (1)]
feel [Frequence: (22) Tag count: (5)]
loosen [Frequence: (22) Tag count: (0)]
sense [Frequence: (22) Tag count: (8)]
year [Frequence: (23) Tag count: (5)]
night [Frequence: (24) Tag count: (736)]
project [Frequence: (24) Tag count: (1)]
convert [Frequence: (25) Tag count: (0)]
plan [Frequence: (25) Tag count: (43)]
judge [Frequence: (25) Tag count: (3)]
school [Frequence: (26) Tag count: (108)]
turn [Frequence: (26) Tag count: (4)]
figure [Frequence: (26) Tag count: (0)]
separate [Frequence: (27) Tag count: (3)]
train [Frequence: (29) Tag count: (5)]
develop [Frequence: (30) Tag count: (45)]
aim [Frequence: (31) Tag count: (4)]
transform [Frequence: (32) Tag count: (3)]
class [Frequence: (32) Tag count: (190)]
make [Frequence: (40) Tag count: (34)]
form [Frequence: (40) Tag count: (1)]
design [Frequence: (50) Tag count: (13)]

Some interesting new words appeared, other less interesting appeared too. This is just an example of the impact of adding the sets of words describing the sister concepts of the previous set of possible topics. We could do the same thing by adding the set of more general concepts related with our current set of concepts (Hypernymification) or by adding the set of more specific concepts related with our current set of concepts (Hyponymification).

This first test I made with a real world example is quite interesting and even promising. So, what I will do with that? Keep checking at talkdigger.com in the next month.

Alexia opens its teragigs of indexes: can Talk Digger get advantage of it?

December 13, 2005May 22, 2006 Frederick Giasson

Alexia (Amazon.com) just started a new web service that will give access to Alexia’s databases to anyone who needs it. It is really great news. I am all excited to see that big companies are opening themselves and making their data publicly available to anyone who needs it.

I am talking about how I see the future of the Web since some months. I am talking about the vision I have of the future of the Internet with the Semantic Web, etc. I talked about how the Web could change if everybody makes his gathered/processed/indexed content publicly available.

Yesterday I released a totally new version of Talk Digger. I talked about how I would like to make the computed results available to anyone who needs it. It is a dream I have, it is a reality that Amazon makes. Talk Digger and Alexia results would not be the same, the users would not too, but in a case or another, it goes in a vision of things that could change the way we use the Internet, the way that the Internet growth.

The new version of Talk Digger is using a web service of Google: PageRank. It is really great way to try to see what is the credibility of the people that are talking about an URL; it is a great way to know who the people that participate to a conversation are. It is sure that it is not the best and only way to do that, but it is a good start. In fact I am designing a system, a new feature of Talk Digger, that I think it could be a good way to see, analyze and interpret these conversations. In a case or another, it is a great feature that will be part of Talk Digger for long (as long as Google gives access to their API through a web service).

There is the point: Talk Digger goes ever further in displaying its results using the service of another company.

Now, would it be possible to integrate the new Alexia web service to enhance Talk Digger’s results? It would be really great considering all the stuff we have access too using the web service. I could even compare the Google’s PageRank with Alexia’s Popularity system to compute a unique indicator that would use both services (none are full-proof, but both of them could be complementary).

The problem with Alexia’s service is that I am restricted to one request per IP per second. The thing is that if you start a search for an URL and receive 70 results, then Talk Digger requested the PageRank of these 70 URLs in less than a second. So, I cannot really implement Alexia’s new web service in Talk Digger with this restriction.

In a case or another, Amazon has done a great thing by creating this new web service. I hope that other companies follow them in that direction.

Talk Digger Beta 2.0: a totally new system and interface

December 12, 2005May 21, 2006 Frederick Giasson

I talked about it in my previous blog posts. I worked on it during the last two months. Then the new Talk Digger website is released.

I will call this version Beta 2.0. In fact, I would call it the Beta 1.0 considering that the first version of Talk Digger was in reality an Alpha one. Everything is new: the underlying system, the interface, the design, the RSS feed, etc. Why do I re-programmed/re-designed everything? Because I wanted to get rid of the first mistakes I have done in the previous version; I wanted to design it in such a way that it would be a good base to extend it in a new type of service (that I will develop in the next months).

So, what is new in this version?

1. I designed a more traditional search engine layout. This one is much simpler than the previous one. I wanted to make Talk Digger simple (but not simpler!). I tried to make it more intuitive for new users.

2. Much more results are displayed by Talk Digger (between 10 to 20 depending on the search engine).

3. Some new search engines: Google Blog and Yahoo!

4. I added a really great feature to the system (thanks to Tom Sherman for the idea): the PageRank of each returned results by Talk Digger is displayed beside the title of the items. This is really great because it gives trustable information about each result: what is its popularity and credibility on the Internet.

5. New options have been added to the system. Now you can specify the maximum number of results you want to view per search engine. You can sort the results with the most recent entries first or with the highest PageRank first.

6. I created a hotkeys system that helps users with the usability and navigation of the website.

7. The tracking RSS feed system is now formatted using RSS 1.0 instead of RSS 2.0. I briefly explained why in that previous post.

8. All the duplicated results (the same article returned by two different search engines) are deleted (only one will be displayed). You also have the option to exclude results with the same domain name as the searched URL’s.

9. It works on IE/FireFox/Safari/Opera on both PC and MAC. The entire website is XHTML1.0 Strict and CSS validated.

10. A new slogan: “You talk, we dig!” (thanks to Bora Ung)

I also created a “Tour” section that show how Talk Digger works:

What is the near future of Talk Digger?

During the next month, I will work improving Talk Digger with the feedbacks from users; but I will also check the possibility to broadcast the TD results in RDF and/or OWL. That way, other services would be able to gather and understand the computed results returned by Talk Digger and being able to do want they want with the information (a first step into the semantic web…). I am currently designing the RDF Schemas (and the OWL ontology) that will describe the Talk Digger results. However I am not sure that I will open such a service right now (considering the network infrastructure it would need and my current lack of money).

In fact, it would be the first step to test Talk Digger as a Semantic Web service. The next phase of its development will go even further in that direction (it’s the goal).

What is The future of Talk Digger?

Two lines of research: (1) semantic web and (2) semantic analysis/management of web documents.

So, this is what is happening right now with Talk Digger.

Do not hesitate to contact me is you have any questions, comments or suggestions about that new version of Talk Digger: it is always greatly appreciated.

I would like to thank Tom Sherman, Jeff Nolan and Matthew Hurst for their comments and suggestions about this new version of Talk Digger. I would also like to thank Suzanne Morel at Les Graphoides for this totally new graphical design and the time she spent working, re-working and re-re-working on the graphics as my mind changed.

I hope you like this new Talk Digger version and find it as useful as I.

Frederick Giasson

Machine Learning, Engineering & Data

Month: December 2005

New Talk Digger feature: Regional view of results

The business model of a Semantic Web service

Preliminary analysis: some results of the topic-extracting module of Talk Digger

Alexia opens its teragigs of indexes: can Talk Digger get advantage of it?

Talk Digger Beta 2.0: a totally new system and interface