How to Deploy Hugging Face Models in a Docker Container

In this short tutorial, we will explore how Hugging Face models can be deployed in a Docker Container and exposed as a web service endpoint.

The service it exposes is a translation service from English to French and French to English.

Why someone would like to do that? Other than to learn about those specific technologies, it is a very convenient way to try and test the thousands of models that exists on Hugging Face, in a clean and isolated environment that can easily be replicated, shared or deployed elsewhere than on your local computer.

In this tutorial, you will learn how to use Docker to create a container with all the necessary code and artifacts to load Hugging Face models and to expose them as web service endpoints using Flask.

All code and configurations used to write this blog post are available in this GitHub Repository. You simply have to clone it and to run the commands listed in this tutorial to replicate the service on your local machine.

Installing Docker

The first step is to install Docker. The easiest way is by simply installing Docker Desktop which is available on MacOS, Windows and Linux.

Creating the Dockerfile

The next step is to create a new Git repository where you will create a Dockerfile. The Dockerfile is where all instructions are written that tells Docker how to create the container.

I would also strongly encourage you to install and use hadolint, which is a really good Docker linter that helps people to follow Docker best practices. There is also a plugin for VS Code if this is what you use as you development IDE.

Base image and key installs

The first thing you define in a Dockerfile is the base image to use to initialize the container. For this tutorial, we will use Ubuntu’s latest LTS:

# Use Ubuntu's current LTS
FROM ubuntu:jammy-20230804

Since we are working to create a Python web service that expose the predictions of a ML model, the next step is to add they key pieces required for the Python service. Let’s make sure that you only include what is necessary to minimize the size, and complexity, of the container as much as possible:

# Make sure to not install recommends and to clean the 
# install to minimize the size of the container as much as possible.
RUN apt-get update && \
    apt-get install --no-install-recommends -y python3=3.10.6-1~22.04 && \
    apt-get install --no-install-recommends -y python3-pip=22.0.2+dfsg-1ubuntu0.3 && \
    apt-get install --no-install-recommends -y python3-venv=3.10.6-1~22.04 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

This instruct Docker to install Python3, pip and venv. It also ensures that apt get cleaned of cached files, that nothing more is installed and that we define the exact version of the package we want to install. That is to ensure that we minimize the size of the container, while making sure that the container can easily be reproduced, with the exact same codebase, any time in the future.

Another thing to note: we run multiple commands with a single RUN instruction by piping them together with &&. This is to minimize the number of layers created by Docker for the container, and this is a best practice to follow when creating containers. If you don’t do this and run hadolint, then you will get warning suggesting you to refactor your Dockerfile accordingly.

Copy required files

Now that the base operating system is installed, the next step is to install all the requirements of the Python project we want to deploy in the container:

# Set the working directory within the container
WORKDIR /app

# Copy necessary files to the container
COPY requirements.txt .
COPY main.py .
COPY download_models.py .

First we define the working directory with the WORKDIR instruction. From now on, every other instruction will run from that directory in the container. We copy the local files: requirements.txt, main.py and download_models.py to the working directory.

Create virtual environment

Before doing anything with those files, we are better creating a virtual environment where to install all those dependencies. Some people may wonder why we create an environment within an environment? It is further isolation between the container and the Python application to make sure that there is no possibility of dependencies clashes. This is a good best practice to adopt.

# Create a virtual environment in the container
RUN python3 -m venv .venv

# Activate the virtual environment
ENV PATH="/app/.venv/bin:$PATH"

Install application requirements

Once the virtual environment is created and activated in the container, the next step is to install all the required dependencies in that new environment:

    # Install Python dependencies from the requirements file
RUN pip install --no-cache-dir -r requirements.txt && \
    # Get the models from Hugging Face to bake into the container
    python3 download_models.py

It runs pip install to install all the dependencies listed in requirements.txt. The dependencies are:

transformers==4.30.2
flask==2.3.3
torch==2.0.1
sacremoses==0.0.53
sentencepiece==0.1.99

Just like the Ubuntu package version, we should (have to!) pin (specify) the exact version of each dependency. This is the best way to ensure that we can reproduce this environment any time in the future and to prevent unexpected crashes because code changed in some downstream dependencies that causes issues with the code.

Downloading all models in the container

As you can see in the previous RUN command, the next step is to download all models and tokenizers in the working directory such that we bake the model’s artifacts directly in the container. That will ensures that we minimize the time it takes to initialize a container. We spend the time to download all those artifacts at build time instead of run time. The downside is that the containers will be much bigger depending on the models that are required.

The download_models.py file is a utility file used to download the Hugging Face models used by the service directly into the container. The code simply download the models and tokenizer files from Hugging Face and save them locally (in the working directory of the container):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import os

def download_model(model_path, model_name):
    """Download a Hugging Face model and tokenizer to the specified directory"""
    # Check if the directory already exists
    if not os.path.exists(model_path):
        # Create the directory
        os.makedirs(model_path)

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Save the model and tokenizer to the specified directory
    model.save_pretrained(model_path)
    tokenizer.save_pretrained(model_path)

# For this demo, download the English-French and French-English models
download_model('models/en_fr/', 'Helsinki-NLP/opus-mt-en-fr')
download_model('models/fr_en/', 'Helsinki-NLP/opus-mt-fr-en')

Creating the Flask translation web service endpoint

The last thing we have to do with the Dockerfile is to expose the port where the web service will be available and to tell the container what to run when it starts:

# Make port 6000 available to the world outside this container
EXPOSE 6000

ENTRYPOINT [ "python3" ]

# Run main.py when the container launches
CMD [ "main.py" ]

We expose the port 6000 to the outside world, and we tell Docker to run the python3 command with main.py. The main.py file is a very simple file that register the web service’s path using Flask, and that makes the predictions (translations in this case):

from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def get_model(model_path):
    """Load a Hugging Face model and tokenizer from the specified directory"""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
    return model, tokenizer

# Load the models and tokenizers for each supported language
en_fr_model, en_fr_tokenizer = get_model('models/en_fr/')
fr_en_model, fr_en_tokenizer = get_model('models/fr_en/')

app = Flask(__name__)

def is_translation_supported(from_lang, to_lang):
    """Check if the specified translation is supported"""
    supported_translations = ['en_fr', 'fr_en']
    return f'{from_lang}_{to_lang}' in supported_translations

@app.route('/translate/<from_lang>/<to_lang>/', methods=['POST'])
def translate_endpoint(from_lang, to_lang):
    """Translate text from one language to another. This function is 
    called when a POST request is sent to /translate/<from_lang>/<to_lang>/"""
    if not is_translation_supported(from_lang, to_lang):
        return jsonify({'error': 'Translation not supported'}), 400

    data = request.get_json()
    from_text = data.get(f'{from_lang}_text', '')

    if from_text:
        model = None
        tokenizer = None

        match from_lang:
            case 'en':        
                model = en_fr_model
                tokenizer = en_fr_tokenizer
            case 'fr':
                model = fr_en_model
                tokenizer = fr_en_tokenizer

        to_text = tokenizer.decode(model.generate(tokenizer.encode(from_text, return_tensors='pt')).squeeze(), skip_special_tokens=True)

        return jsonify({f'{to_lang}_text': to_text})
    else:
        return jsonify({'error': 'Text to translate not provided'}), 400
    
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=6000, debug=True)

Building the container

Now that the Dockerfile is completed, the next step is to use it to have Docker to build the actual image of the container. This is done using this command in the terminal:

docker build -t localbuild:en_fr_translation_service .

Note that we specified a tag to make it easier to manage it in between all the other images that may exists in the environment. The output of the terminal will show every step defined in the Dockerfile, and the processing for each of those step. The final output looks like:

Running and Querying the service

Now that we have a brand new image, the next step is to test it. In this section, I will use Docker Desktop’s user interface to show how we can easily do this, but all those step can easily be done (and automated) using the docker command line application.

After you built the image, it will automatically appear in the images section of Docker Desktop:

You can see the tag of the image, its size, when it was created, etc. To start the container from that image, we simply have to click the play arrow in the Actions column. That will start running a new container using that image.

Docker Desktop will enable you to add some more parameter to start the container with the following window:

The most important thing to define here is to Host port. If you leave it empty, then the port 6000 we exposed in the Docker file will become unbound and we won’t be able to reach the service running in the container.

Once you click the Run button, the container will appear in the Containers section:

And if you click on it’s name’s link, you will have access to the internal of the container (the files it contains, the execution logs, etc.:

Now that the container is running, we can query the endpoint like this:

curl http://localhost:6000/translate/en/fr/ POST -H "Content-Type: application/json" -v -d '{"en_text": "Towards Certification of Machine Learning-Based Distributed Systems Behavior"}'

It returns:

{
  "fr_text": "Vers la certification des syst\u00e8mes distribu\u00e9s fond\u00e9s sur l'apprentissage automatique"
}

And then for the French to English translation:

curl http://localhost:6000/translate/fr/en/ POST -H "Content-Type: application/json" -v -d '{"fr_text": "Ce qu'\''il y a d'\''admirable dans le bonheur des autres, c'\''est qu'\''on y croit."}'

It returns:

{
  "en_text": "What is admirable in the happiness of others is that one believes in it."
}

Conclusion

As we can see, it is pretty straightforward to create simple Docker containers that turns pretty much any Hugging Face pre-trained models into a web service endpoint.

Introducing ReadNext: A Personal Papers Recommender

Every day, approximately 500 new papers are published in the cs category on arXiv, with tens of new papers in cs.AI alone. Amidst the recent craze around Generative AI, I found it increasingly challenging to keep up with the rapid influx of papers. Distilling the ones that were most relevant to my work and my employer’s interests became a daunting task.

ReadNext is born out of my need to have a command-line tool that gets the most recent papers from arXiv, and feed the most relevants ones to my current interests into Zotero.

The key focus is to recommend papers that align with my evolving interests and research objectives, which may change on a daily basis and need to be continuously accounted for.

Why ReadNext?

  • Command-line Tool: ReadNext can be executed directly or scheduled as a cron job for easy access.
  • ReadNext fetches the latest papers from arXiv, ensuring you’re informed about your current interests
  • ReadNext integrates with Zotero, allowing you to manage your research library and organize recommended papers.
  • The core focus of ReadNext is to provide personalized paper recommendations based on your research interests, directly in your personal papers management tool.

How to Install

Getting started with ReadNext is simple. Install it using pip:

pip install readnext

Requirements

ReadNext relies on two fundamental external services to enhance its functionality:

  • Zotero: Zotero serves as the primary papers management tool, playing a pivotal role in ReadNext’s workflow. To configure ReadNext on your local computer, you have to create a Zotero account. If you do not already have one, you will have to create one for yourself, please refer to the section below.
  • Cohere: ReadNext leverages Cohere’s services for generating paper embeddings and summaries. These embeddings and summaries are essential components for providing personalized and relevant paper recommendations. It is necessary to create an account with Cohere. We will be expending support for additional embeddings and summarization services in the future, offering increased flexibility.

By integrating these services, ReadNext helps in discovering papers that align with your research interests and focus.

Read more about how to properly configure ReadNext here.

How Does ReadNext Work?

  1. As a Zotero user, I will create one or multiple “Focus” collections in my Zotero library. Those are the collections where I will add the papers that are the most interesting to my current research. It is expected that the content of those collections will change over time as my research focus and interests evolves.
  2. On a daily basis, I will run readnext in my terminal, or I will create a cron job to run it automatically for me.
    1. ReadNext will fetch the latest papers from arXiv
    2. ReadNext will identify the papers that are relevant to your research focus, as defined in Zotero
    3. ReadNext will propose the relevant papers to me and add them to Zotero in a dedicated collection where proposed papers are saved
  3. I will go in Zotero, start to read the proposed papers, and if any are of a particular interest I will add them to one of the “Focus” collections
  4. ReadNext will learn from your feedback to improve the quality of the proposed papers

How to Use ReadNext?

Using ReadNext is easy. Here are the main commands you’ll use:

Help

To get contextual help for any command, run:

readnext --help 
readnext personalized-papers --help

Get New Paper Proposals

The following command will propose 3 papers from the cs.AI caterory, based on the Readnext-Focus-LLMcollection in my Zotero library, save them in Zotero in the Readnext-Propositions-LLM with all related artifacts:

readnext personalized-papers cs.AI Readnext-Focus-LLM --proposals-collection=Readnext-Propositions-LLM --with-artifacts --nb-proposals=3

Full documentation of how to use the command line tool is available here.

Future Work and Contributions

Future work includes adding an abstraction layer for multiple embedding services, expanding paper sources, enhancing test coverage, providing interactive configuration, and refining the paper selection process.

Contributions to ReadNext are welcome! Follow the steps outlined in the README file of the project to contribute.

After More Than 10 years In Business

I delayed this blog post for far too long.

Almost exactly one year ago I had to take a heartbreaking decision for myself and for my long term business partner and friend Mike Bergman. I had to stop working on our business projects, Structured Dynamics and Cognonto such that can restart bringing incoming for my family.

For more than ten years Mike and I had the good fortune to be able to spend all our time working together on all kind of interesting projects and doing really mind challenging research into the field of the Semantic Web, and more recently, its applications to the field of the Artificial Intelligence.

Our last business project was a company called Cognonto, and more precisely its huge knowledge graph called KBpedia. This project, just like anything we did prior to it, was everything about research and development, trying to push ideas, concepts and principles to the market. We put all our thoughts, energy, time and [monetary] resources in pursuing our goals.

The problem is that we have never been able to monetize this new endeavor unlike the other projects we created in the previous decade.

After more than a year and a new baby boy, after spending all the resources I had available for the project I had to take a decision for me and my growing family.  I had to seek for a new job.

That is why I have been silent for so long. I had to reorganize my time after being self employed for about fifteen years and dealing with two young boys.

However, I had the good fortune to be contacted by Curbside about at the same time that I took that decision to seek for a new job. I now lead  the creation, design and development of their internal machine learning environment.

Good fortune being what it is, I really do enjoy my new work, the importance of it, the design and research I am putting in it and the wonderful team that I am part of.

We dissolved the Structure Dynamics company a few months ago. However not everything is done. Mike is working on a really personal and important project related to KBpedia for which we have important announcements to do in the coming months.

Lastly, I would like to take the time to say thank you to Mike Bergman for all the time we spend together working on those wonderful semantic web and knowledge base representation projects together. All those daily chats and calls we had discussing and arguing  to advance our ideas. And for all those things you taught me about research methodology, business and life. I owe you much my friend!

A Machine Learning Workflow

I am giving a talk (in French) at the 85th edition of the ACFAS congress, May 9. I will discuss the engineering aspects of doing machine learning. But more importantly, I will discuss how Semantic Web techniques, technologies and specifications can help solving the engineering problems and how they can be leveraged and integrated in a machine learning workflow.

The focus of my talk is based on my work in the field of the semantic web in the last 15 years and my more recent work creating the KBpedia Knowledge Graph at Cognonto and how they influenced our work to develop different machine learning solutions to integrate data, to extend knowledge structure, to tag and disambiguate concepts and entities in corpuses of texts, etc.

One thing we experienced is that most of the work involved in such project is not directly related to machine learning problems (or at least related to the usage of machine learning algorithms). And then I recently read a survey conducted by CrowdFlower in 2016 that support what we experienced. They surveyed about 80 data scientists to probe them to find out “where they feel their profession is going, [and] what their day-to-day job is like” To the question: “What data scientists spend the most time doing”, they answered:

Continue reading “A Machine Learning Workflow”

KBpedia Knowledge Graph 1.40: Extended Using Machine Learning

I am proud to announce the immediate release of the KBpedia Knowledge Graph version 1.40. This new version of the knowledge graph includes 53,739 concepts which is 14,687 more than with the previous version. It also includes 251,848 new alternative labels for 20,538 previously existing concepts in the version 1.20, and 542 new definitions.

This new version of KBpedia will have an impact on multiple different knowledge graph related tasks such as concepts and entities tagging and most of the existing Cognonto use cases. I will be discussing these updates and their effects on the use cases in a forthcoming series of blog posts.

But the key topic of this current blog post is this: How have we been able to increase the coverage of the KBpedia Knowledge Graph by 37.6% while keeping it consistent (that is, there are no contradictory facts) and satisfiable (that is, checks to see if the candidate addition violates any existing class disjointness assertions), all within roughly a single month of FTE effort?

Continue reading “KBpedia Knowledge Graph 1.40: Extended Using Machine Learning”