As soon as finished and released the last version of Talk Digger I started to work on a new prototype module that tries to extract topics of returned results by search engines. What are these topics? They are the topics that evolve in a conversation (and a conversation is a set of articles that link to a specific URL returned by different search engines).
I release these preliminary results because I find them somewhat interesting (so it could possibly interest another person too; in fact I read this blog post by Anjo Anjewierden yesterday when I developing these tests, so I thought that I could write a little something on the subject)).
These results are based on the results returned by Technorati with a search on a recent article of the BBC. The set of raw texts returned is defined by:
DOCUMENT 3
URL: http://www.jnoelbell.me.uk/2005/12/21/so-much-news-so-little-time/
be resolved shortly. I shudder to think about the people trying to get to the airports in a day or two. secondly, thank god people are coming to their senses . “intelligent design� is just a pretext for promoting religion, which has no place in the public schools. you don’t like it?
[…]
DOCUMENT 7
URL: http://godcountryyale.blogspot.com/2005/12/suck-it-fox-news.html
wild glory days when I got linked to from Not Even Wrong… Anyway, big news today is that the “intelligent design” case in Dover got struck down ( BBC , CNN), an event that was made all the merrier due to the fact that I first heard about it on Fox News while channel surfing. If you
[…]
Click here to see the whole set
The next step is to perform some lexical analysis techniques to ‘purify’ the raw text. The resulting set of purified texts is:
DOCUMENT 3
resolve shudder think people try airport sense design pretext religion place public school
[…]
DOCUMENT 7
wild glory day link wrong today design case dove down event fact first while channel surf
[…]
Click here to see the whole set
As you can notice, only nouns remain. The reason is simple: I assume that the words that have the greatest semantic meaning to describe topics are nouns. In the next steps, verb, adverbs and adjectives will possibly be added to these sets because of their possible semantic relations with these nouns in other conceptual domains.
The 10 most frequent words of this set will create the set of possible topics of the conversation. The set is defined by:
there [Frequence: (2) Tag count: (0)]
teach [Frequence: (3) Tag count: (0)]
today [Frequence: (3) Tag count: (13)]
federal [Frequence: (3) Tag count: (0)]
class [Frequence: (3) Tag count: (190)]
school [Frequence: (3) Tag count: (108)]
sense [Frequence: (3) Tag count: (8)]
judge [Frequence: (3) Tag count: (3)]
pretext [Frequence: (3) Tag count: (0)]
design [Frequence: (11) Tag count: (13)]
An interesting metric I make explicit in these results is the “tag count”. The tag count is the number of time the word appears in The Brown Corpus. It tells us what the “popularity” of the word is. So if I have to choose between two words with the same meaning, I will choose the one with the greatest tag count because it is the one that is the most used in English literature.
The next step is trying to find new topics with semantic relations with the existing possible ones, or to strengthen the currents one.
If you check a lexicon, you will see that each word can be defined by one or more sets of synonyms. In the current example I take the assumption that a word is defined by all his synonyms sets (it is an assumption I do to make things simpler, but in real world, I would have to find which of the synonym sets define the words by his meaning in that context). I make the guess that other words from the same article and the other articles (belonging to the conversation) will smooth the error’s effect on the results.
So, there is the set of possible topics augmented by the synonym sets of each words belonging to the set of possible topics.
social_class, socio-economic_class, course_of_instruction, course_of_study, course, category, family, division, year, grade, form, civilise, civilize, schooling, schoolhouse, feel, signified, shoal, schooltime, cultivate, train, educate, school_day, classify, sort, learn, Blackbeard, Edward_Thatch, Thatch, instruct, at_that_place, on_that_point, in_that_respect, thither, in_that_location, Edward_Teach, Teach, Fed, separate, sort_out, assort, federal_official, Federal_soldier, now, nowadays, Union, Union_soldier, sensory_faculty, sentiency, project, aim, intention, guise, pretense, evaluator, stalking-horse, pretence, intent, purpose, invention, figure, designing, innovation, excogitation, blueprint, conception, justice, contrive, jurist, common_sense, try, gumption, horse_sense, sentience, sensation, mother_wit, adjudicate, good_sense, estimate, approximate, guess, magistrate, label, pronounce, gauge, pattern, Federal, plan, pretext, teach, there, federal, today, judge, class, sense, school, design
The more interesting words of this set are:
pattern [Frequence: (2) Tag count: (9)]
Federal [Frequence: (3) Tag count: (0)]
plan [Frequence: (3) Tag count: (43)]
pretext [Frequence: (5) Tag count: (0)]
teach [Frequence: (5) Tag count: (0)]
there [Frequence: (6) Tag count: (0)]
federal [Frequence: (6) Tag count: (0)]
today [Frequence: (7) Tag count: (13)]
judge [Frequence: (10) Tag count: (3)]
class [Frequence: (12) Tag count: (190)]
sense [Frequence: (12) Tag count: (8)]
school [Frequence: (13) Tag count: (108)]
design [Frequence: (25) Tag count: (13)]
NOTE: If you check, you can think that the frequencies are not good. The reason is that I added the frequency of the previous sets with the ones of the synonyms set.
There are three interesting facts: (1) the appearance of the concept “plan”; (2) the upgrade of the concept “school” forced by his semantic links with the synonyms sets of the other words belonging to the set; and (3) the downgrade of the concept “pretext”.
The current set of possible topics is now defined by the 10 most frequent nouns we extracted and the synonym sets of each of these words.
The final step performed to find the topics of a conversation is to augment the set of possible topics with the words that describes the same concepts as the one in the set (the sister concepts). The resulting set is defined by:
Texas_Independence_Day, February_22, March_2, Washington’s_Birthday, St_Patrick’s_Day, April_Fools’, March_17, Saint_Patrick’s_Day, February_14, St_Valentine’s_Day, February_2, Groundhog_Day, holiday, Tet, Lincoln’s_Birthday, February_12, Saint_Valentine’s_Day, Valentine’s_Day, Valentine_Day, April_Fools’_day, All_Fools’_day, Father’s_Day, June_14, Flag_Day, June_3, Citizenship_Day, September_17, October_24, United_Nations_Day, American_Indian_Day, Davis’_Birthday, Jefferson_Davis’_Birthday, Patriot’s_Day, April_14, Pan_American_Day, May_Day, First_of_May, Armed_Forces_Day, Mother’s_Day, May_1, January_19, Robert_E_Lee_Day, old_age, middle_age, adulthood, salad_days, geezerhood, deathbed, commencement_day, Arbor_Day, Admission_Day, bloom_of_youth, mid-nineties, golden_years, mid-sixties, sixties, seventies, mid-seventies, nineties, mid-eighties, eighties, degree_day, November_5, market_day, ides, election_day, polling_day, Walpurgis_Night, New_Year’s_Eve, Halloween, Robert_E_Lee’s_Birthday, December_31, payday, red-letter_day, leap_day,
[…]
measure, time, estimate, dull, strike, age, point, gauge, dissolve, denature, label, indicate, intention, order, acquaint, obscure, resolve, get, sensitize, moderate, sensitise, blunt, blur, division, contrive, take, draw, purpose, tame, report, course, try, construct, pattern, run, bring, touch, season, think, life, activate, break, grade, set, shift, feel, loosen, sense, year, night, project, convert, plan, judge, school, turn, figure, separate, train, develop, aim, transform, class, make, form, design
Click here to see the whole set
As you can notice, there is a little exponential explosion. This is a problem and this is the reason why I should take decisions, at each step, to keep the best words that could describe the topics of a conversation.
The most interesting words in this new set are:
think [Frequence: (19) Tag count: (0)]
life [Frequence: (19) Tag count: (107)]
activate [Frequence: (20) Tag count: (2)]
break [Frequence: (20) Tag count: (0)]
grade [Frequence: (20) Tag count: (17)]
set [Frequence: (21) Tag count: (24)]
shift [Frequence: (22) Tag count: (1)]
feel [Frequence: (22) Tag count: (5)]
loosen [Frequence: (22) Tag count: (0)]
sense [Frequence: (22) Tag count: (8)]
year [Frequence: (23) Tag count: (5)]
night [Frequence: (24) Tag count: (736)]
project [Frequence: (24) Tag count: (1)]
convert [Frequence: (25) Tag count: (0)]
plan [Frequence: (25) Tag count: (43)]
judge [Frequence: (25) Tag count: (3)]
school [Frequence: (26) Tag count: (108)]
turn [Frequence: (26) Tag count: (4)]
figure [Frequence: (26) Tag count: (0)]
separate [Frequence: (27) Tag count: (3)]
train [Frequence: (29) Tag count: (5)]
develop [Frequence: (30) Tag count: (45)]
aim [Frequence: (31) Tag count: (4)]
transform [Frequence: (32) Tag count: (3)]
class [Frequence: (32) Tag count: (190)]
make [Frequence: (40) Tag count: (34)]
form [Frequence: (40) Tag count: (1)]
design [Frequence: (50) Tag count: (13)]
Some interesting new words appeared, other less interesting appeared too. This is just an example of the impact of adding the sets of words describing the sister concepts of the previous set of possible topics. We could do the same thing by adding the set of more general concepts related with our current set of concepts (Hypernymification) or by adding the set of more specific concepts related with our current set of concepts (Hyponymification).
This first test I made with a real world example is quite interesting and even promising. So, what I will do with that? Keep checking at talkdigger.com in the next month.
Technorati: npl | semantic | natural | text | analysis | topics | conversation | talkdigger |