Frederick Giasson – Page 73 – Machine Learning, Engineering & Data

Victoria Holt

January 4, 2006 Frederick Giasson

“Never regret. If it’s good, it’s wonderful. If it’s bad, it’s experience.”

— Victoria Holt

New Talk Digger feature: Regional view of results

December 31, 2005May 21, 2006 Frederick Giasson

I just released a new feature for Talk Digger. The idea of the feature is born with a wish of a PR worker called David Jones that wanted to be able to see and sort results per regions (countries). It was important for him considering that his client cared more about the comments from people in their targeted markets than the others. I found that that feature was essential for Talk Digger; not just for marketing and PR workers but also for everybody. What I like with this idea is that it put a touch of humanness in the digged conversations. It gives a new metric to users to try to analyze who are talking in a conversation. So I take the last two days to develop and release that new feature.

What this new feature is all about? It is called Regional setting. This setting let you enable the regional view of each result. If that feature is enabled (by default it is disabled) a flag of the country where the server that host the resulting web will appears. This option is useful when you try to find people living in a specific country that talk about an URL. This option is especially helpful for marketing and PR workers that have to do regional searches for the products of their clients.

How to interpret the flags? The flag appearing beside a title shows the country where the web page is hosted. If a Japanese blogger host his blog in America, then you will see the flag of the United-States except if he do not have a generic domain name (.com, .net, etc.) but a country one (.jp). However, people generally take their country domain names or at least host their web pages with a local web hoster. Considering the situation, I would say that 70% of the displayed flags represent the country where the creator of the result lives.

If you enable that feature, you will be able to sort the incoming results by countries. For example, here the first results will be the Canadian pages and all the others will be grouped by countries.

I hope that you will find that new feature, another one that saw the day by my interaction with Talk Digger users, as useful as I.

Happy New Year!

The business model of a Semantic Web service

December 27, 2005May 21, 2006 Frederick Giasson

What could be the business model of a semantic web service? First of all, what could be a Semantic Web service? We could define it as a web service that broadcast his content using formatting technologies like RDF or OWL. The content is formatted in such a way that software agents can easily read it and could infer knowledge. The key principle of a semantic web service is his openness.

I was talking with one of the creator of the search engines used by Talk Digger yesterday. During our conversation he asked me:

Are you planning to share any revenues with the blog engines? Do you have a business model for Talk Digger?

His question was legitimate but the answer was short: no. No I do not have any business model for Talk Digger. Right now it is an experimentation; a personal little research project. I could share revenues, however I doubts they could interest anyone considering that it merely pay the hosting fees.

However, I already thought about it and the question is good: What could be a business model for a service such as Talk Digger. Even more: what could be a business model for a Semantic Web service; a service that broadcast his computations at large, without any fees (Talk Digger could become such a service… it is a wish, remember?).

I have no idea.

John Heilemann wrote in the New York Metro:

“Alan Murray wrote a column in the Wall Street Journal that called Google’s business model a new kind of feudalism: The peasants produce the content; Google makes the profits,”

It is right, and not just for Google. Search Engines gather and index web content from everywhere. They even cache entire web pages content and publish them without any kind of permissions (remember the “cache” feature on many search engines?). I remember that secret US military documents have already been indexed by Google (just a rumor or a fact? I don’t remember). I also know that Google index millions of books content without any kind of permissions.

So, the question remains: what could be the business model for Talk Digger? And if revenues are generated with such a model, should I share it with other search engines companies? Sure I should; they can ban the IP address of my crawler at anytime. However, is it at their advantage? Considering the picture I described of some current search engines, why I couldn’t scrap their web page for some results? If we think about it, their results pages are web document like any other web pages on the Internet. If they can scrap others’ web page, why I could not too?

At the end everything is about money. However, they told us that they democratize the Internet by making it searchable. I told them that I democratize it too by aggregating their results in a “novel” way.

What is the Internet? The democratization of World’s information or a money making cow? I hope no.

Preliminary analysis: some results of the topic-extracting module of Talk Digger

December 21, 2005May 21, 2006 Frederick Giasson

As soon as finished and released the last version of Talk Digger I started to work on a new prototype module that tries to extract topics of returned results by search engines. What are these topics? They are the topics that evolve in a conversation (and a conversation is a set of articles that link to a specific URL returned by different search engines).

I release these preliminary results because I find them somewhat interesting (so it could possibly interest another person too; in fact I read this blog post by Anjo Anjewierden yesterday when I developing these tests, so I thought that I could write a little something on the subject)).

These results are based on the results returned by Technorati with a search on a recent article of the BBC. The set of raw texts returned is defined by:

DOCUMENT 3

URL: http://www.jnoelbell.me.uk/2005/12/21/so-much-news-so-little-time/

be resolved shortly. I shudder to think about the people trying to get to the airports in a day or two. secondly, thank god people are coming to their senses . â€œintelligent designâ€� is just a pretext for promoting religion, which has no place in the public schools. you donâ€™t like it?
[…]

DOCUMENT 7

URL: http://godcountryyale.blogspot.com/2005/12/suck-it-fox-news.html

wild glory days when I got linked to from Not Even Wrong… Anyway, big news today is that the “intelligent design” case in Dover got struck down ( BBC , CNN), an event that was made all the merrier due to the fact that I first heard about it on Fox News while channel surfing. If you
[…]

Click here to see the whole set

The next step is to perform some lexical analysis techniques to ‘purify’ the raw text. The resulting set of purified texts is:

DOCUMENT 3

resolve shudder think people try airport sense design pretext religion place public school

[…]

DOCUMENT 7

wild glory day link wrong today design case dove down event fact first while channel surf

[…]

Click here to see the whole set

As you can notice, only nouns remain. The reason is simple: I assume that the words that have the greatest semantic meaning to describe topics are nouns. In the next steps, verb, adverbs and adjectives will possibly be added to these sets because of their possible semantic relations with these nouns in other conceptual domains.

The 10 most frequent words of this set will create the set of possible topics of the conversation. The set is defined by:

there [Frequence: (2) Tag count: (0)]

teach [Frequence: (3) Tag count: (0)]

today [Frequence: (3) Tag count: (13)]

federal [Frequence: (3) Tag count: (0)]

class [Frequence: (3) Tag count: (190)]

school [Frequence: (3) Tag count: (108)]

sense [Frequence: (3) Tag count: (8)]

judge [Frequence: (3) Tag count: (3)]

pretext [Frequence: (3) Tag count: (0)]

design [Frequence: (11) Tag count: (13)]

An interesting metric I make explicit in these results is the “tag count”. The tag count is the number of time the word appears in The Brown Corpus. It tells us what the “popularity” of the word is. So if I have to choose between two words with the same meaning, I will choose the one with the greatest tag count because it is the one that is the most used in English literature.

The next step is trying to find new topics with semantic relations with the existing possible ones, or to strengthen the currents one.

If you check a lexicon, you will see that each word can be defined by one or more sets of synonyms. In the current example I take the assumption that a word is defined by all his synonyms sets (it is an assumption I do to make things simpler, but in real world, I would have to find which of the synonym sets define the words by his meaning in that context). I make the guess that other words from the same article and the other articles (belonging to the conversation) will smooth the error’s effect on the results.

So, there is the set of possible topics augmented by the synonym sets of each words belonging to the set of possible topics.

social_class, socio-economic_class, course_of_instruction, course_of_study, course, category, family, division, year, grade, form, civilise, civilize, schooling, schoolhouse, feel, signified, shoal, schooltime, cultivate, train, educate, school_day, classify, sort, learn, Blackbeard, Edward_Thatch, Thatch, instruct, at_that_place, on_that_point, in_that_respect, thither, in_that_location, Edward_Teach, Teach, Fed, separate, sort_out, assort, federal_official, Federal_soldier, now, nowadays, Union, Union_soldier, sensory_faculty, sentiency, project, aim, intention, guise, pretense, evaluator, stalking-horse, pretence, intent, purpose, invention, figure, designing, innovation, excogitation, blueprint, conception, justice, contrive, jurist, common_sense, try, gumption, horse_sense, sentience, sensation, mother_wit, adjudicate, good_sense, estimate, approximate, guess, magistrate, label, pronounce, gauge, pattern, Federal, plan, pretext, teach, there, federal, today, judge, class, sense, school, design

The more interesting words of this set are:

pattern [Frequence: (2) Tag count: (9)]

Federal [Frequence: (3) Tag count: (0)]

plan [Frequence: (3) Tag count: (43)]

pretext [Frequence: (5) Tag count: (0)]

teach [Frequence: (5) Tag count: (0)]

there [Frequence: (6) Tag count: (0)]

federal [Frequence: (6) Tag count: (0)]

today [Frequence: (7) Tag count: (13)]

judge [Frequence: (10) Tag count: (3)]

class [Frequence: (12) Tag count: (190)]

sense [Frequence: (12) Tag count: (8)]

school [Frequence: (13) Tag count: (108)]

design [Frequence: (25) Tag count: (13)]

NOTE: If you check, you can think that the frequencies are not good. The reason is that I added the frequency of the previous sets with the ones of the synonyms set.

There are three interesting facts: (1) the appearance of the concept “plan”; (2) the upgrade of the concept “school” forced by his semantic links with the synonyms sets of the other words belonging to the set; and (3) the downgrade of the concept “pretext”.

The current set of possible topics is now defined by the 10 most frequent nouns we extracted and the synonym sets of each of these words.

The final step performed to find the topics of a conversation is to augment the set of possible topics with the words that describes the same concepts as the one in the set (the sister concepts). The resulting set is defined by:

Texas_Independence_Day, February_22, March_2, Washington’s_Birthday, St_Patrick’s_Day, April_Fools’, March_17, Saint_Patrick’s_Day, February_14, St_Valentine’s_Day, February_2, Groundhog_Day, holiday, Tet, Lincoln’s_Birthday, February_12, Saint_Valentine’s_Day, Valentine’s_Day, Valentine_Day, April_Fools’_day, All_Fools’_day, Father’s_Day, June_14, Flag_Day, June_3, Citizenship_Day, September_17, October_24, United_Nations_Day, American_Indian_Day, Davis’_Birthday, Jefferson_Davis’_Birthday, Patriot’s_Day, April_14, Pan_American_Day, May_Day, First_of_May, Armed_Forces_Day, Mother’s_Day, May_1, January_19, Robert_E_Lee_Day, old_age, middle_age, adulthood, salad_days, geezerhood, deathbed, commencement_day, Arbor_Day, Admission_Day, bloom_of_youth, mid-nineties, golden_years, mid-sixties, sixties, seventies, mid-seventies, nineties, mid-eighties, eighties, degree_day, November_5, market_day, ides, election_day, polling_day, Walpurgis_Night, New_Year’s_Eve, Halloween, Robert_E_Lee’s_Birthday, December_31, payday, red-letter_day, leap_day,

[…]

measure, time, estimate, dull, strike, age, point, gauge, dissolve, denature, label, indicate, intention, order, acquaint, obscure, resolve, get, sensitize, moderate, sensitise, blunt, blur, division, contrive, take, draw, purpose, tame, report, course, try, construct, pattern, run, bring, touch, season, think, life, activate, break, grade, set, shift, feel, loosen, sense, year, night, project, convert, plan, judge, school, turn, figure, separate, train, develop, aim, transform, class, make, form, design

Click here to see the whole set

As you can notice, there is a little exponential explosion. This is a problem and this is the reason why I should take decisions, at each step, to keep the best words that could describe the topics of a conversation.

The most interesting words in this new set are:

think [Frequence: (19) Tag count: (0)]
life [Frequence: (19) Tag count: (107)]
activate [Frequence: (20) Tag count: (2)]
break [Frequence: (20) Tag count: (0)]
grade [Frequence: (20) Tag count: (17)]
set [Frequence: (21) Tag count: (24)]
shift [Frequence: (22) Tag count: (1)]
feel [Frequence: (22) Tag count: (5)]
loosen [Frequence: (22) Tag count: (0)]
sense [Frequence: (22) Tag count: (8)]
year [Frequence: (23) Tag count: (5)]
night [Frequence: (24) Tag count: (736)]
project [Frequence: (24) Tag count: (1)]
convert [Frequence: (25) Tag count: (0)]
plan [Frequence: (25) Tag count: (43)]
judge [Frequence: (25) Tag count: (3)]
school [Frequence: (26) Tag count: (108)]
turn [Frequence: (26) Tag count: (4)]
figure [Frequence: (26) Tag count: (0)]
separate [Frequence: (27) Tag count: (3)]
train [Frequence: (29) Tag count: (5)]
develop [Frequence: (30) Tag count: (45)]
aim [Frequence: (31) Tag count: (4)]
transform [Frequence: (32) Tag count: (3)]
class [Frequence: (32) Tag count: (190)]
make [Frequence: (40) Tag count: (34)]
form [Frequence: (40) Tag count: (1)]
design [Frequence: (50) Tag count: (13)]

Some interesting new words appeared, other less interesting appeared too. This is just an example of the impact of adding the sets of words describing the sister concepts of the previous set of possible topics. We could do the same thing by adding the set of more general concepts related with our current set of concepts (Hypernymification) or by adding the set of more specific concepts related with our current set of concepts (Hyponymification).

This first test I made with a real world example is quite interesting and even promising. So, what I will do with that? Keep checking at talkdigger.com in the next month.

Alexia opens its teragigs of indexes: can Talk Digger get advantage of it?

December 13, 2005May 22, 2006 Frederick Giasson

Alexia (Amazon.com) just started a new web service that will give access to Alexia’s databases to anyone who needs it. It is really great news. I am all excited to see that big companies are opening themselves and making their data publicly available to anyone who needs it.

I am talking about how I see the future of the Web since some months. I am talking about the vision I have of the future of the Internet with the Semantic Web, etc. I talked about how the Web could change if everybody makes his gathered/processed/indexed content publicly available.

Yesterday I released a totally new version of Talk Digger. I talked about how I would like to make the computed results available to anyone who needs it. It is a dream I have, it is a reality that Amazon makes. Talk Digger and Alexia results would not be the same, the users would not too, but in a case or another, it goes in a vision of things that could change the way we use the Internet, the way that the Internet growth.

The new version of Talk Digger is using a web service of Google: PageRank. It is really great way to try to see what is the credibility of the people that are talking about an URL; it is a great way to know who the people that participate to a conversation are. It is sure that it is not the best and only way to do that, but it is a good start. In fact I am designing a system, a new feature of Talk Digger, that I think it could be a good way to see, analyze and interpret these conversations. In a case or another, it is a great feature that will be part of Talk Digger for long (as long as Google gives access to their API through a web service).

There is the point: Talk Digger goes ever further in displaying its results using the service of another company.

Now, would it be possible to integrate the new Alexia web service to enhance Talk Digger’s results? It would be really great considering all the stuff we have access too using the web service. I could even compare the Google’s PageRank with Alexia’s Popularity system to compute a unique indicator that would use both services (none are full-proof, but both of them could be complementary).

The problem with Alexia’s service is that I am restricted to one request per IP per second. The thing is that if you start a search for an URL and receive 70 results, then Talk Digger requested the PageRank of these 70 URLs in less than a second. So, I cannot really implement Alexia’s new web service in Talk Digger with this restriction.

In a case or another, Amazon has done a great thing by creating this new web service. I hope that other companies follow them in that direction.