Introducing ReadNext: A Personal Papers Recommender

Every day, approximately 500 new papers are published in the cs category on arXiv, with tens of new papers in cs.AI alone. Amidst the recent craze around Generative AI, I found it increasingly challenging to keep up with the rapid influx of papers. Distilling the ones that were most relevant to my work and my employer’s interests became a daunting task.

ReadNext is born out of my need to have a command-line tool that gets the most recent papers from arXiv, and feed the most relevants ones to my current interests into Zotero.

The key focus is to recommend papers that align with my evolving interests and research objectives, which may change on a daily basis and need to be continuously accounted for.

Why ReadNext?

  • Command-line Tool: ReadNext can be executed directly or scheduled as a cron job for easy access.
  • ReadNext fetches the latest papers from arXiv, ensuring you’re informed about your current interests
  • ReadNext integrates with Zotero, allowing you to manage your research library and organize recommended papers.
  • The core focus of ReadNext is to provide personalized paper recommendations based on your research interests, directly in your personal papers management tool.

How to Install

Getting started with ReadNext is simple. Install it using pip:

pip install readnext

Requirements

ReadNext relies on two fundamental external services to enhance its functionality:

  • Zotero: Zotero serves as the primary papers management tool, playing a pivotal role in ReadNext’s workflow. To configure ReadNext on your local computer, you have to create a Zotero account. If you do not already have one, you will have to create one for yourself, please refer to the section below.
  • Cohere: ReadNext leverages Cohere’s services for generating paper embeddings and summaries. These embeddings and summaries are essential components for providing personalized and relevant paper recommendations. It is necessary to create an account with Cohere. We will be expending support for additional embeddings and summarization services in the future, offering increased flexibility.

By integrating these services, ReadNext helps in discovering papers that align with your research interests and focus.

Read more about how to properly configure ReadNext here.

How Does ReadNext Work?

  1. As a Zotero user, I will create one or multiple “Focus” collections in my Zotero library. Those are the collections where I will add the papers that are the most interesting to my current research. It is expected that the content of those collections will change over time as my research focus and interests evolves.
  2. On a daily basis, I will run readnext in my terminal, or I will create a cron job to run it automatically for me.
    1. ReadNext will fetch the latest papers from arXiv
    2. ReadNext will identify the papers that are relevant to your research focus, as defined in Zotero
    3. ReadNext will propose the relevant papers to me and add them to Zotero in a dedicated collection where proposed papers are saved
  3. I will go in Zotero, start to read the proposed papers, and if any are of a particular interest I will add them to one of the “Focus” collections
  4. ReadNext will learn from your feedback to improve the quality of the proposed papers

How to Use ReadNext?

Using ReadNext is easy. Here are the main commands you’ll use:

Help

To get contextual help for any command, run:

readnext --help 
readnext personalized-papers --help

Get New Paper Proposals

The following command will propose 3 papers from the cs.AI caterory, based on the Readnext-Focus-LLMcollection in my Zotero library, save them in Zotero in the Readnext-Propositions-LLM with all related artifacts:

readnext personalized-papers cs.AI Readnext-Focus-LLM --proposals-collection=Readnext-Propositions-LLM --with-artifacts --nb-proposals=3

Full documentation of how to use the command line tool is available here.

Future Work and Contributions

Future work includes adding an abstraction layer for multiple embedding services, expanding paper sources, enhancing test coverage, providing interactive configuration, and refining the paper selection process.

Contributions to ReadNext are welcome! Follow the steps outlined in the README file of the project to contribute.

A Machine Learning Workflow

I am giving a talk (in French) at the 85th edition of the ACFAS congress, May 9. I will discuss the engineering aspects of doing machine learning. But more importantly, I will discuss how Semantic Web techniques, technologies and specifications can help solving the engineering problems and how they can be leveraged and integrated in a machine learning workflow.

The focus of my talk is based on my work in the field of the semantic web in the last 15 years and my more recent work creating the KBpedia Knowledge Graph at Cognonto and how they influenced our work to develop different machine learning solutions to integrate data, to extend knowledge structure, to tag and disambiguate concepts and entities in corpuses of texts, etc.

One thing we experienced is that most of the work involved in such project is not directly related to machine learning problems (or at least related to the usage of machine learning algorithms). And then I recently read a survey conducted by CrowdFlower in 2016 that support what we experienced. They surveyed about 80 data scientists to probe them to find out “where they feel their profession is going, [and] what their day-to-day job is like” To the question: “What data scientists spend the most time doing”, they answered:

Continue reading “A Machine Learning Workflow”

Disambiguating KBpedia Knowledge Graph Concepts

One of the most important natural language processing tasks is to “tag” concepts in text. Tagging a concept means determining whether words or phrases in a text document matches any of the concepts that exist in some kind of a knowledge structure (such as a knowledge graph, an ontology, a taxonomy, a vocabulary, etc.). (BTW, a similar definition and process applies to tagging an entity.) What is usually performed is that the input text is parsed and normalized in some manner. Then all of the surface forms of the concepts within the input knowledge structure (based on their preferred and alternative labels) are matched against the words within the text. “Tagging” is when a match occurs between a concept in the knowledge structure and one of its surface forms in the input text.

But here is the problem. Given the ambiguous world we live in, often this surface form, which after all is only a word or phrase, may be associated with multiple different concepts. When we identify the surface form of “bank”, does that term refer to a financial institution, the shore of a river, a plane turning, or a pool shot? identical surface forms may refer to totally different concepts. Further, sometimes a single concept will be identified but it won’t be the right concept, possibly because the right one is missing from the knowledge structure or other issues.

A good way to view this problem of ambiguity is to analyze a random Web page using the Cognonto Demo online application. The demo usea the Cognonto Concepts Tagger service to tag all of the existing KBpedia knowledge graph concepts found in the target Web page. Often, when you analyze what has been tagged by the demo, you may see some of these amgibuities or wrongly tagged concepts yourself. For instance, check out this example. If you mouse over the tagged concepts, you will notice that many of the individual “tagged” terms refer to multiple KBpedia concepts. Clearly, in its basic form, this Cognonto demo is not disambiguating the concepts.

The purpose of this article is thus to explain how we can “disambiguate” (that is, suggest the proper concept from an ambiguous list) the concepts that have been tagged. What we will do is to show how we can leverage the KBpedia knowledge graph structure as-is to perform this disambiguation. What we will do is to create graph embeddings for each of the KBpedia concepts using the DeepWalk algorithm. Then we will perform simple linear algebra equations on the graph embeddings to determine if the tagged concept(s) is the right one given its context or not. We will test multiple different algorithms and strategies to analyze the impact on the overall disambiguation performance of the system.

Continue reading “Disambiguating KBpedia Knowledge Graph Concepts”

Extended KBpedia With Wikipedia Categories

A knowledge graph is an ever evolving structure. It needs to be extended to be able to cope with new kinds of knowledge; it needs to be fixed and improved in all kinds of different ways. It also needs to be linked to other sources of data and to other knowledge representations such as schemas, ontologies and vocabularies. One of the core tasks related to knowledge graphs is to extend its scope. This idea seems simple enough, but how can we extend a general knowledge graph that has nearly 40,000 concepts with potentially multiple thousands more? How can we do this while keeping it consistent, coherent and meaningful? How can we do this without spending undue effort on such a task? These are the questions we will try to answer with the methods we cover in this article.

The methods we are presenting in this article are how we can extend Cognonto‘s KBpedia Knowledge Graph using an external source of knowledge, one which has a completely different structure than KBpedia and one which has been built completely differently with a different purpose in mind than KBpedia. In this use case, this external resource is the Wikipedia categories structure. What we will show in this article is how we may automatically select the right Wikipedia categories that could lead to new KBpedia concepts. These selections are made using a SVM classifier trained over graph embedding vectors generated by a DeepWalk model based on the KBpedia Knowledge Graph structure linked to the Wikipedia categories. Once appropriate candidate categories are selected using this model, the results are then inspected by a human to take the final selection decisions. This semi-automated process takes 5% of the time it would normally take to conduct this task by comparable manual means.

Continue reading “Extended KBpedia With Wikipedia Categories”

Leveraging KBpedia Aspects To Generate Training Sets Automatically

In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning 1 , 2 , 3 using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.

So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:

  1. Using the links that exist between each KBpedia reference concept and their related Wikipedia pages
  2. Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of
    named entities.

Now we will introduce a third way to create a different kind of training corpus:

  1. Using the KBpedia aspects linkages.

Aspects are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. It is another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.

[extoc]

Continue reading “Leveraging KBpedia Aspects To Generate Training Sets Automatically”