Clojure, Cognonto, Semantic Web

Using Cognonto to Generate Domain Specific word2vec Models

word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.

In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).

It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a word’s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.

Training Corpus

A training corpus is really just a set of text used to train unsupervised machine learning algorithms. Any kind of text can be used by word2vec. The only thing it does is to learn the relationships between the words that exist in the text. However, not all training corpuses are equal. Training corpuses are often dirty, biaised and ambiguous. Depending on the task at hand, it may be exactly what is required, but more often than not, their errors need to be fixed. Cognonto has the advantage of starting with clean text.

When we want to create a new training corpus, the first step is to find a source of text that could work to create that corpus. The second step is to select the text we want to add to it. The third step is to pre-process that corpus of text to perform different operations on the text, such as: removing HTML elements; removing punctuation; normalizing text; detecting named entities; etc. The final step is to train word2vec to generate the model.

word2vec is somewhat dumb. It only learns what exists in the training corpus. It does not do anything other than “reading” the text and analyzing the relationships between the words (which are really just group of characters separated by spaces). The word2vec process is highly subject to the Garbage In, Garbage Out principle, which means that if the training set is dirty, biaised and ambiguous, then the learned relationship will end-up being of little or no value.

Domain-specific Training Corpus

A domain-specific training corpus is a specialized training corpus where its text is related to a specific domain. Examples of domains are music, mathematics, cars, healthcare, etc. In contrast, a general training corpus is a corpus of text that may contain text that discusses totally different domains. By creating a corpus of text that covers a specific domain of interest, we limit the usage of words (that is, their co-occurrences) to texts that are meaningful to that domain.

As we will see in this article, a domain-specific training corpus can be quite useful, and much more powerful, than general ones, if the task at hand is in relation to a specific domain of expertise. The major problem with domain-specific training corpuses is that they are really costly to create. We not only have to find the source of data to use, but we also have to select each document that we want to include in the training corpus. This can work if we want a corpus with 100 or 200 documents, but what if you want a training corpus of 100,000 or 200,000 documents? Then it becomes a problem.

It is the kind of problem that Cognonto helps to resolve. Cognonto and KBpedia, its knowledge base, is a set of ~39,000 reference concepts that have ~138,000 links to schema of external data sources such as Wikipedia, Wikidata and USPTO. It is that structure and these links to external data sources that we use to create domain-specific training corpuses on the fly. We leverage the reference concept structure to select all of the concepts that should be part of the domain that is being defined. Then we use Cognonto’s inference capabilities to infer all the other hundred or thousands of concepts that define the full scope of the domain. Then we analyze the hundreds or thousands of concepts we selected that way to get all of the links to external data sources. Finally we use these references to create the training corpus. All of this is done automatically once the initial few concepts that define my domain got selected. The workflow looks like:

cognonto-workflow

The Process

To show you how this process works, I will create a domain-specific training set about musicians using Cognonto. Then I will use the Google News word2vec model created by Google and that has about 100 billion words. The Google model contains 300-dimensional vectors for 3 million words and phrases. I will use the Google News model as the general model to compare the results/performance between a domain specific and a general model.

Determining the Domain

The first step is to define the scope of the domain we want to create. For this article, I want a domain that is somewhat constrained to create a training corpus that is not too large for demo purposes. The domain I have chosen is musicians. This domain is related to people and bands that play music. It is also related to musical genres, instruments, music industry, etc.

To create my domain, I select a single KBpedia reference concept: Musician. If I wanted to broaden the scope of the domain, I could have included other concepts such as: Music, Musical Group, Musical Instrument, etc.

Aggregating the Domain-specific Training Corpus

Once we have determined the scope of the domain, the next step is to query the KBpedia knowledge base to aggregate all of the text that will belong to that training corpus. The end result of this operation is to create a training corpus with text that is only related to the scope of the domain we defined.

(defn create-domain-specific-training-set
  [target-kbpedia-class corpus-file]
  (let [step 1000
        entities-dataset "http://kbpedia.org/knowledge-base/"
        kbpedia-dataset "http://kbpedia.org/kko/"
        nb-entities (get-nb-entities-for-class-ws target-kbpedia-class entities-dataset kbpedia-dataset)]
    (loop [nb 0
           nb-processed 1]
      (when (< nb nb-entities)
        (doseq [entity (get-entities-slice target-kbpedia-class entities-dataset kbpedia-dataset :limit step :offset @nb-processed)]          
          (spit corpus-file (str (get-entity-content entity) "\n") :append true)
          (println (str nb-processed "/" nb-entities)))
        (recur (+ nb step)
               (inc nb-processed))))))

(create-domain-specific-training-set "http://kbpedia.org/kko/rc/Musician" "resources/musicians-corpus.txt")

What this code does is to query the KBpedia knowledge base to get all the named entities that are linked to it, for the scope of the domain we defined. Then the text related to each entity is appended to a text file where each line is the text of a single entity.

Given the scope of the current demo, the musicians training corpus is composed of 47,263 documents. This is the crux of the demo. With a simple function, we are able to aggregate 47,263 text documents highly related to a conceptual domain we defined on the fly. All of the hard work has been delegated to the knowledge base and its conceptual structure (in fact, this simple function leverages 8 years of hard work).

Normalizing Text

The next step is a natural step related to any NLP pipeline. Before learning from the training corpus, we should clean and normalize the text of its raw form.

(defn normalize-proper-name
  [name]
  (-> name
      (string/replace #" " "_")      
      (string/lower-case)))

(defn pre-process-line
  [line]  
  (-> (let [line (-> line
                     ;; 1. remove all underscores
                     (string/replace "_" " "))]
        ;; 2. detect named entities and change them with their underscore form, like: Fred Giasson -> fred_giasson
        (loop [entities (into [] (re-seq #"[\p{Lu}]([\p{Ll}]+|\.)(?:\s+[\p{Lu}]([\p{Ll}]+|\.))*(?:\s+[\p{Ll}][\p{Ll}\-]{1,3}){0,1}\s+[\p{Lu}]([\p{Ll}]+|\.)" line))
               line line]
          (if (empty? entities)
            line
            (let [entity (first (first entities))]
              (recur (rest entities)                     
                     (string/replace line entity (normalize-proper-name entity)))))))
      (string/replace (re-pattern stop-list) " ")
      ;; 4. remove everything between brackets like: [1] [edit] [show]
      (string/replace #"\[.*\]" " ")
      ;; 5. punctuation characters except the dot and the single quote, replace by nothing: (),[]-={}/\~!?%$@&*+:;<>
      (string/replace #"[\^\(\)\,\[\]\=\{\}\/\\\~\!\?\%\$\@\&\*\+:\;\<\>\"\p{Pd}]" " ")
      ;; 6. remove all numbers
      (string/replace #"[0-9]" " ")
      ;; 7. remove all words with 2 characters or less
      (string/replace #"\b[\p{L}]{0,2}\b" " ")
      ;; 10. normalize spaces
      (string/replace #"\s{2,}" " ")
      ;; 11. normalize dots with spaces
      (string/replace #"\s\." ".")
      ;; 12. normalize dots
      (string/replace #"\.{1,}" ".")
      ;; 13. normalize underscores
      (string/replace #"\_{1,}" "_")
      ;; 14. remove standalone single quotes
      (string/replace " ' " " ")
      ;; 15. re-normalize spaces
      (string/replace #"\s{2,}" " ")        
      ;; 16. put everything lowercase
      (string/lower-case)

      (str "\n")))

(defn pre-process-corpus
  [in-file out-file]
  (spit out-file "" :append true)
  (with-open [file (clojure.java.io/reader in-file)]
    (doseq [line (line-seq file)]
      (spit out-file (pre-process-line line) :append true))))

(pre-process-corpus "resources/musicians-corpus.txt" "resources/musicians-corpus.clean.txt")

We remove all of the characters that may cause issues to the tokenizer used by the word2vec implementation. We also remove unnecessary words and other words that appear too often or that add nothing to the model we want to generate (like the listing of days and months). We also drop all numbers.

Training word2vec

The last step is to train word2vec on our clean domain-specific training corpus to generate the model we will use. For this demo, I will use the DL4J (Deep Learning for Java) library that is a Java implementation of the word2vec algorithm. Training word2vec is as simple as using the DL4J API like this:

(defn train
  [training-set-file model-file]
  (let [sentence-iterator (new LineSentenceIterator (clojure.java.io/file training-set-file))
        tokenizer (new DefaultTokenizerFactory)
        vec (.. (new Word2Vec$Builder)
                (minWordFrequency 1)
                (windowSize 5)
                (layerSize 100)
                (iterate sentence-iterator)
                (tokenizerFactory tokenizer)
                build)]
    (.fit vec)
    (SerializationUtils/saveObject vec (io/file model-file))
    vec))

(def musicians-model (train "resources/musicians-corpus.clean.txt" "resources/musicians-corpus.model"))

What is important to notice here is the number of parameters that can be defined to train word2vec on a corpus. In fact, that algorithm can be sensitive to parametrization.

Importing the General Model

The goal of this demo is to demonstrate the difference between a domain-specific model and a general model. Remember that the general model we chose was the Google News model that is composed of billion of words, but which is highly general. DL4J can import that model without having to generate it ourselves (in fact, only the model is distributed by Google, not the training corpus):

(defn import-google-news-model
  []
  (org.deeplearning4j.models.embeddings.loader.WordVectorSerializer/loadGoogleModel (clojure.java.io/file "GoogleNews-vectors-negative300.bin.gz") true))

(def google-news-model (import-google-news-model))

Playing With Models

Now that we have a domain-specific model related to musicians and a general model related to news processed by Google, let’s start playing with both to see how they perform on different tasks. In the following examples, we will always compare the domain-specific training corpus with the general one.

Ambiguous Words

A characteristic of words is that their surface form can be ambiguous; they can have multiple meanings. An ambiguous word can co-occur with multiple other words that may not have any shared meaning. But all of this depends on the context. If we are in a general context, then this situation will happen more often than we think and will impact the similarity score of these ambiguous terms. However, as we will see, this phenomenum is greatly diminished when we use domain-specific models.

Similarity Between Piano, Organ and Violin

What we want to check is the relationship between 3 different music instruments: piano, organ and violin. We want to check the relationship between each of them.

(.similarity musicians-model "piano" "violin")
0.8422856330871582
(.similarity musicians-model "piano" "organ")
0.8573281764984131

As we can see, both tuples have a high likelihood of co-occurrence. This suggests that these terms of each tuple are probably highly related. In this case, it is probably because violins are often played along with a piano. And, it is probably that an organ looks like a piano (at least it has a keyboard).

Now let’s take a look at what the general model has to say about that:

(.similarity google-news-model "piano" "violin")
0.8228187561035156
(.similarity google-news-model "piano" "organ")
0.46168726682662964

The surprising fact here is the apparent dissimilarity between piano and organ compared with the results we got with the musicians domain-specific model. If we think a bit about this use case, we will probably conclude that these results makes sense. In fact, organ is an ambiguous word in a general context. An organ can be a musical instrument, but it can also be a part of an anatomy. This means that the word organ will co-occur beside piano, but also all kind of other words related to human and animal biology. This is why they are less similar in the general model than in the domain one, because it is an ambiguous word in a general context.

Similarity Between Album and Track

Now let’s see another similarity example between two other words album and track where track is an ambiguous word depending on the context.

(.similarity musicians-model "album" "track")
0.7943775653839111
(.similarity google-news-model "album" "track")
0.18461623787879944

As expected, because track is ambiguous, there is a big difference in terms of co-occurence probabilities depending on the context (domain-specific or general).

Similarity Between Pianist and Violinist

However, are domain-specific and general differences always the case? Let’s take a look at two words that are domain specific and unambiguous: pianist and violinist.

(.similarity musicians-model "pianist" "violinist")
0.8430571556091309
(.similarity google-news-model "pianist" "violinist")
0.8616064190864563

In this case, the similarity score between the two terms is almost the same. In both contexts (generals and domain specific), their co-occurrence is similar.

Nearest Words

Now let’s look at the similarity between two distinct words in two new and distinct contexts. Let’s take a look at a few words and see what other words occur most often with them.

Music

(.wordsNearest musicians-model ["music"] [] 7)
music revol samoilovich bunin musical amalgamating assam. voice dance.
(.wordsNearest google-news-model ["music"] [] 8)
music classical music jazz Music Without Donny Kirshner songs musicians tunes

One observation we can make is that the terms from the musicians model are more general than the ones from the general model.

Track

(.wordsNearest musicians-model ["track"] [] 10)
track released. album latest entitled released debut year. titled positive
(.wordsNearest google-news-model ["track"] [] 5)
track tracks Track racetrack horseshoe shaped section

As we know, track is ambiguous. The difference between these two sets of nearest related words is striking. There is a clear conceptual correlation in the musicians’ domain-specific model. But in the general model, it is really going in all directions.

Year

Now let’s take a look at a really general word: year

(.wordsNearest musicians-model ["year"] [] 11)
year ghantous. he was grammy naacap grammy award for best luces del alma year. grammy award grammy for best sitorai sol nominated
(.wordsNearest google-news-model ["year"] [] 10)
year month week months decade years summer year.The September weeks

This one is quite interesting too. Both groups of words makes sense, but only in their respective contexts. With the musicians’ model, year is mostly related to awards (like the Grammy Awards 2016), categories like “song of the year”, etc.

In the context of the general model, year is really related to time concepts: months, seasons, etc.

Playing With Co-Occurrences Vectors

Finally we will play with manipulating the co-occurrences vectors by manipulating them. A really popular word2vec equation is king - man + women = queen. What is happening under the hood with this equation is that we are adding and substracting the co-occurences vectors for each of these words, and we check the nearest word of the resulting co-occurence vector.

Now, let’s take a look at a few of these equations.

Pianist + Renowned = ?

(.wordsNearest musicians-model ["pianist" "renowned"] [] 9)
pianist renowned teacher. composer. prolific virtuoso teacher leading educator.
(.wordsNearest google-news-model ["pianist" "renowned"] [] 7)
renowned pianist pianist composer jazz pianist classical pianists composer pianist virtuoso pianist

These kind of operations are kind of interesting. If we add the two co-occurrence vectors for pianist and renowned then we get that a teacher, an educator, a composer or a virtuoso is a renowned pianist.

For unambiguous surface forms like pianist, then the two models score quite well. The difference between the two examples comes from the way the general training corpus has been created (pre-processed) compared to the musicians corpus.

Metal + Death = ?

(.wordsNearest musicians-model ["metal" "death"] [] 10)
metal death thrash deathcore melodic doom grindcore metalcore mathcore heavy
(.wordsNearest google-news-model ["metal" "death"] [] 5)
death metal Tunstallbled steel Death

This example uses two quite general words with no apparent relationship between them. The results with the musicians’ model are all the highly similar genre of music like trash metal, deathcore metal, etc.

However with the general model, it is a mix of multiple unrelated concepts.

Metal – Death + Smooth = ?

Let’s play some more with these equations. What if we want some kind of smooth metal?

(.wordsNearest musicians-model ["metal" "smooth"] ["death"] 5)
smooth fusion funk hard neo

This one is quite interesting. We substracted the death co-occurrence vector to the metal one, and then we added the smooth vector. What we end-up with is a bunch of music genres that are much smoother than death metal.

(.wordsNearest google-news-model ["metal" "smooth"] ["death"] 5)
smooth metal Brushed aluminum durable polycarbonate chromed steel

In the case of the general model, we end-up with “smooth metal”. The removal of the death vector has no effect on the results, probably since these are three ambiguous and really general terms.

What Is Next

The demo I presented in this article uses public datasets currently linked to KBpedia. You may wonder what are the other possibilities? Another possibility is to link your own private datasets to KBpedia. That way, these private datasets would then become usable, exactly in the same way, to create domain-specific training corpuses on the fly. Another possibility would be to take totally unstructured text like local text documents, or semi-structured text like a set of HTML web pages. Then, tag them using the Cognonto topics analyzer to tag each of the text document using KBpedia reference concepts. Then we could use the KBpedia structure exactly the same way to choose which of these documents we want to include in the domain-specific training corpus.

Conclusion

As we saw, creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models. This is quite an interesting characteristic since smaller models means they are faster to generate, faster to download/upload, faster to query, consumes less memory, etc.

Of the concepts in KBpedia, roughly 33,000 of them correspond to types (or classes) of various sorts. These pre-determined slices are available across all needs and domains to generate such domain-specific corpuses. Further, KBpedia is designed for rapid incorporation of your own domain information to add further to this discriminatory power.

Cognonto, Semantic Web

Cognonto

I am proud to announce the start of a new venture called Cognonto. I am particularly proud of it because even if it is just starting, it is in fact more than eight years old. It is the embodiment of eight years of research, of experimentation, of a big deal of frustration and of great joy with my long-time partner Mike. cognonto_logo-square

Eight years ago, we set a 5-to-10-year vision for our work as partners. We defined an initial series of technological goals for which we outlined a series of yearly milestones. The goals were related to help solving decades old problems with data integration and interoperability using a completely new research field (at the time): the Semantic Web.

And there we are eight years later, after working for an endless number of hours to create all kinds of different projects and services to pay for the research and the pieces of technologies we develop for these purposes. Cognonto is the embodiment of that effort, but it also created a series of other purposeful projects such as the creation of Stuctured Dynamics, UMBEL, the Open Semantic Framework and a series of other open source collaterals.

We spent eight years to create, sanitize, to make coherent and consistent, to generate and regenerate a conceptual structure of now 38,930 reference concepts with 138,868 mapping links to 27 external schemas, vocabularies and datasets. This led to the creation of KBpedia, which is the knowledge graph that drives Cognonto. The full statistics are available here.

I can’t thank Mike enough for this long and wonderful journey that led to the creation of Cognonto. I sent him an endless number of concepts lists that he diligently screened, assessed and mapped. We spent hundred of hours to discuss the knots and bolts of the structure, to argue about its core concepts and how it should be defined and used. It was not without pain, but I believe that the result is truly astonishing.

I won’t copy/paste the Cognonto press release here, a link will suffice. I it is just not possible for me to write a better introduction than the two pagers that Mike wrote for the press release. I would also suggest that you read his Cognonto introduction blog post: Cognonto is on the Hunt for Big AI Game.

In the coming weeks, I will write a lot about Cognonto, what it is, how it can be used, what are its use cases, how the information that is presented in the demo and the knowledge graph sections should be interpreted and what these pages tell you.

Open Semantic Framework, OSF for Drupal, Planet Drupal, Semantic Web

Winnipeg City’s NOW [Data] Portal

The Winnipeg City’s NOW (Neighbourhoods Of Winnipeg) Portal is an initiative to create a complete neighbourhood web portal for its citizens. At the core of the project we have a set of about 47 fully linked, integrated and structured datasets of things of interests to Winnipegers. The focal point of the portal is Winnipeg’s 236 neighbourhoods, which define the main structure of the portal. The portal has six main sections: topics of interests, maps, history, census, images and economic development. The portal is meant to be used by citizens to find things of interest in their neibourhood, to learn their history, to see the images of the things of interest, to find tools to help economic development, etc.

The NOW portal is not new; Structured Dynamics was also its main technical contractor for its first release in 2013. However we just finished to help Winnipeg City’s NOW team to migrate their older NOW portal from OSF 1.x to OSF 3.x and from Drupal 6 to Drupal 7; we also trained them on the new system. Major improvements accompany this upgrade, but the user interface design is essentially the same.

The first thing I will do is to introduce each major section of the portal and I will explain the main features of each. Then I will discuss the new improvements of the portal.

Datasets

A NOW portal user won’t notice any of this, but the main feature of the portal is the data it uses. The portal manages 47 datasets (and growing) of fully structured, integrated and linked datasets of things of interests to Winnipegers. What the portal does is to manage entities. Each kind of entity (swimming pools, parks, places, images, addresses, streets, etc.) are defined with multiple properties and values. Several of the entities reference other entities in other datasets (for example, an assessment parcel from the Assessment Parcels dataset references neighbourhoods entities and property addresses entities from their respective datasets).

The fact that these datasets are fully structured and integrated means that we can leverage these characteristics to create a powerful search experience by enabling filtering of the information on any of the properties, to bias the searches depending where a keyword search match occurs, etc.

Here is the list of all the 47 datasets that currently exists in the portal:

  1. Aboriginal Service Providers
  2. Arenas
  3. Neighbourhoods of Winnipeg City
  4. Streets
  5. Economic Development Images
  6. Recreation & Leisure Images
  7. Neighbourhoods Images
  8. Volunteer Images
  9. Library Images
  10. Parks Images
  11. Census 2006
  12. Census 2001
  13. Winnipeg Internal Websites
  14. Winnipeg External Websites
  15. Heritage Buildings and Resources
  16. NOW Local Content Dataset
  17. Outdoor Swimming Pools
  18. Zoning Parcels
  19. School Divisions
  20. Property Addresses
  21. Wading Pools
  22. Electoral wards of Winnipeg City
  23. Assessment Parcels
  24. Libraries
  25. Community Centres
  26. Police Service Centers
  27. Community Gardens
  28. Leisure Centres
  29. Parks and Open Spaces
  30. Community Committee
  31. Commercial real estates
  32. Sports and Recreation Facilities
  33. Community Characterization Areas
  34. Indoor Swimming Pools
  35. Neighbourhood Clusters
  36. Fire and Paramedic Stations
  37. Bus Stops
  38. Fire and Paramedic Service Images
  39. Animal Services Images
  40. Skateboard Parks
  41. Daycare Nurseries
  42. Indoor Soccer Fields
  43. Schools
  44. Truck Routes
  45. Fire Stations
  46. Paramedic Stations
  47. Spray Parks Pads

Structured Search

The most useful feature of the portal to me is its full-text search engine. It is simple, clean and quite effective. The search engine is configured to try to give the most relevant results a NOW portal user may be searching. For example, it will positively bias some results that comes from some specific datasets, or matches that occurs in specific property values. The goal of this biasing is to improve the quality of the returned results. This is somewhat easy to do since the context of the portal is well known and we can easily boost scoring of search results since everything is fully structured.

Another major gain is that all the search results are fully templated. The search results do not simply return a title and some description for your search results. It does template all the information the system has about the matched results, but also displays the most relevant information to the users in the search results.

For example, if I search for a indoor swimming pool, in most of the cases it may be to call the front desk to get some information about the pool. This is why different key information will be displayed directly in the search results. That way, most of the users won’t even have to click on the result to get the information they were looking for directly in the search results page.

Here is an example of a search for the keywords main street. As you can notice, you are getting different kind of results. Each result is templated to get the core information about these entities. You have the possibility to focus on particular kind of entities, or to filter by their location in specific neighbourhoods.

now--search-1

Templated Search Results

Now let’s see some of the kind of entities that can be searched on the portal and how they are presented to the users.

Here is an example of an assessment parcel that is located in the St. John’s neighbourhood. The address, the value, the type and the location of the parcel on a map is displayed directly into the search results.

now--template-search-assessment-pacels

Another kind of entity that can be searched are the property addresses. These are located on a map, the value of the parcels and the building and the zoning of the address is displayed. The property is also linked to its assessment parcel entity which can be clicked to get additional information about the parcel.

now--template-search-property-address

Another interesting type of entity that can be searched are the streets. What is interesting in this case is that you get the complete outline of the street directly on a map. That way you know where it starts and where it ends and where it is located in the city.

now--template-search-street

There are more than a thousand geo-localized images of all different things in the city that can be searched. A thumbnail of the image and the location of the thing that appears on the image appears in the search results.

now--template-search-heritage-building-image

If you were searching for a nursery for your new born child, then you can quickly see the name, location on a map and the phone number of the nursery directly in the search result.

now--template-search-nurseries

There are just a few examples of the fifty different kind of entities that can appear like this in the search results.

Mapping

The mapping tool is another powerful feature of the portal. You can search like if you were using the full-text search engine (the top search box on the portal) however you will only get the results that can be geo-localized on a map. You can also simply browse entities from a dataset or you can filter entities by their properties/values. You can persist entities you find on the map and save the map for future reference.

In the example below, it shows that someone searched for a street (main street) and then he persisted it on the map. Then he search for other things like nurseries and selected the ones that are near the street he persisted, etc. That way he can visualize the different known entities in the portal on a map to better understand where things are located in the city, what exists near a certain location, within a neighbourhood, etc.

now--map

Census Analysis

Census information is vital to the good development of a city. They are necessary to understand the trends of a sector, who populates it, etc., such that the city and other organizations may properly plan their projects to have has much impact as possible.

These are some of the reason why one of the main section of the site is dedicated to census data. Key census indicators have been configured in the portal. Then users can select different kind of regions (neighbourhood clusters, community areas and electoral wards) to get the numbers for each of these indicators. Then they can select multiple of these regions to compare each other. A chart view and a table view is available for presenting the census data.

now--census

History, Images & Points of Interest

The City took the time to write the history of each of its neighbourhoods. In additional to that, they hired professional photographs to photograph the points of interests of the city, to geo-localize them and to write a description for each of these photos. Because of this dedication, users of the portal can learn a much about the city in general and the neighbourhood they live in. This is what the History and Image sections of the website are about.

now--history

Historic buildings are displayed on a map and they can be browsed from there.

now--history-heritage-buildings

Images of points of interests in the neighbourhood are also located on a map.

now--history-heritage-resources

Find Your Neighbourhood

Ever wondered in which neighbourhood you live in? No problem, go on the home page, put your address in the Find your Neighbourhood section and you will know it right away. From there you can learn more about your neighbourhood like its history, the points of interest, etc.

now--find-your-neighbourhood

Your address will be located on a map, and your neighbourhood will be outlined around it. Not only you will know in which neighbourhood you live, but you will also know where you live within it. From there you can click on the name of the neigbourhood to get to the neighbourhood’s page and start learning more about it like its history, to see photos of points of interest that exists in your neighbourhood, etc.

now--find-your-neighbourhood-result

Browsing Content by Topic

Because all the content of the portal is fully structured, it is easy to browse its content using a well defined topic structure. The city developed its own ontology that is used to help the users browse the content of the portal by browsing topics of interest. In the example below, I clicked the Economic Development node and then the Land use topic. Finally I clicked the Map button to display things that are related to land use: in this case, zoning and assessment parcels are displayed to the user.

This is another way to find meaningful and interesting content from the portal.

now--topics

Depending on the topic you choose, and the kind of information related to that topic, you may end up with different options like a map, a list of links to documents related to that topic, etc.

Export Content

Now that I made an overview of each of the main features of the portal, let’s go back to the geeky things. The first thing I said about this portal is that at its core, all information it manages is fully structured, integrated and linked data. If you get to the page of an entity, you have the possibility to see the underlying data that exists about it in the system. You simply have to click the Export tab at the top of the entity’s page. Then you will have access to the description of that entity in multiple different formats.

now--export-entity

In the future, the City should (or at least I hope will) make the whole set of datasets fully downloadable. Right now you only have access to that information via that export feature per entity. I hope because this NOW portal is fully disconnected from another initiative by the city: data.winnipeg.ca, which uses Socrata. The problem is that barely any of the datasets from NOW are available on data.winnipeg.ca, and the ones that are appearing are the raw ones (semi-structured, un-documented, un-integrated and non-linked) all the normalization work, the integration work, the linkage work done by the NOW team hasn’t been leveraged to really improve the data.winnipeg.ca datasets catalog.

New with the upgrades

Those who are familiar with the NOW portal will notice a few changes. The user interface did not change that much, but multiple little things got improved in the process. I will cover the most notable of these changes.

The major changes that happened are in the backend of the portal. The data management in OSF for Drupal 7 is incompatible with what was available in Drupal 6. The management of the entities became easier, the configuration of OSF networks became a breeze. A revisioning system has been added, the user interface is more intuitive, etc. There is no comparison possible. However, portal users’ won’t notice any of this, since these are all site administrator functions.

The first thing that users will notice is the completely new full-text search engine. The underlying search engine is almost the same, but the presentation is far better. All entity types have gotten their own special template, which are displayed in a special way in the search results. Most of the time results should be much more relevant, filtering is easier and cleaner. The search experience is much better in my view.

The overall site performance is much better since different caching strategies have been put in place in OSF 3.x and OSF for Drupal. This means that most of the features of the portal should react more swiftly.

Now every type of entity managed by the portal is templated: their webpage is templated in specific ways to optimize the information they want to convey to users along with their search result “mini page” when they get returned as the result of a search query.

Multi-linguality is now fully supported by the portal, however not everything is currently templated. However expect a fully translated NOW portal in French in the future.

Creating a Network of Portals

One of the most interesting features that goes with this upgrade is that the NOW portal is now in a position to participate into a network of OSF instances. What does that mean? Well, it means that the NOW portal could create partnerships with other local (regional, national or international) organizations to share datasets (and their maintenance costs).

Are there other organizations that uses this kind of system? Well, there is at least another one right in Winnipeg City: MyPeg.ca, also developed by Structured Dynamics. MyPeg uses RDF to model its information and uses OSF to manage its information. MyPeg is a non-profit organization that uses census (and other indicator) data to do studies on the well being of Winnipegers. The team behind MyPeg.ca are research experts in indicator data. Their indicator datasets (which includes census data) is top notch.

Let’s hypothetize that there would be interest between the two groups to start collaborating. Let’s say that the NOW portal would like to use MyPeg’s census datasets instead of its own since they are more complete, accurate and include a larger number of important indicators. What they basically want is to outsource the creation and maintenance of the census/indicators data to a local, dedicated and highly professional organization. The only things they would need to do is to:

  1. Formalize their relationship by signing a usage agreement
  2. The NOW portal would need to configure the MyPeg.ca OSF network into their OSF for Drupal instance
  3. The NOW portal would need to register the datasets it want to use from MyPeg.ca.

Once these 3 steps are done, taking no more than a couple of minutes, then the system administrators of the NOW portal could start using the MyPeg.ca indicator datasets like they were existing on their own network. (The reverse could also be true for MyPeg.) Everything would be transparent to them. From then on, all the fixes and updates performed by MyPeg.ca to their indicator datasets would immediately appear on the NOW portal and accessible to its users.

This is one possibility to collaborate. Another possibility would be to simply on a routine basis (every month, every 6 months, every year) share the serialized datasets such that the NOW portal re-import the dataset from the files shared by MyPeg.ca. This is also possible since both organizations use the same Ontology to describe the indicator data. This means that no modification is required by the City to take that new information into account, they only have to import and update their local datasets. This is the beauty of ontologies.

Conclusion

The new NOW portal is a great service for citizens of Winnipeg City. It is also a really good example of a web portal that leverages fully structured, integrated and linked data. To me, the NOW portal is a really good example of the features that should go along with a municipal data portal.