Frederick Giasson – Machine Learning, Engineering & Data

Data Reliability Engineering

May 17, 2024 Frederick Giasson

I am happy to be able to share about one of the things that I have been up to since I started working at Dayforce. What Is that thing?

Data Reliability Engineering

I had the opportunity to put in place a new functional area called Data Reliability Engineering. This may look good, but you may wonder what this thing is about.

Data Reliability Engineering (DRE) can be seen as a child of Site Reliability Engineering (SRE). The foundation of DRE is SRE. Organizationally speaking, we embedded DRE in the SRE organization at Dayforce.

DRE is SRE for Machine Learning and Data systems.

A DRE team focuses on, and is responsible for, ensuring that data pipelines, storage, and retrieval systems are reliable, robust, and scalable. It borrows principles from software engineering, DevOps, and site reliability engineering (SRE), to apply them to data-intensive systems.

The goal of the team is to ensure that data, which is a critical business asset, is consistently available, accurate, and timely available for different processes such as auditing, machine learning data training, analysis, and to different stakeholders such as data scientists, ML engineers, data analysts, etc.

A DRE team makes sure that the right Data Service-Level Indicators (DSLIs) are in place, that the Data Service-Level Objectives (DSLOs) and Agreements (DSLAs) are respected and constantly monitored. It also helps with the automation of the data movements, to increase the observability of the data pipelines and data systems, with the management of incidents incurring data availability and supporting teams with all the above.

Overall, it ensures that the data used to generate analytics reports, machine learning models or any Dayforce features is accurate, reliable, and available on time.

A data reliability engineer (DRE) is a professional responsible for implementing and managing data reliability engineering principles. They act as the guardians of data integrity and availability within the organization.

The DRE team act as trusted advisors for the company, actively participating in data platform infrastructure design and scalability considerations. It is responsible for implementing and managing data reliability engineering principles. It acts as the guardian of data integrity and availability within the organization.

Move Fast by Reducing the Cost of Failure

DRE helps teams to move fast by reducing the cost of failure of Machine Learning and Data projects. Some will say that it makes it a slow start, but it pays off in the long run. We focus on development velocity in the long term, not the short, burst of work to ship features.

DRE (and SRE) helps improve the product development output.

How? By reducing the MTTR (Mean Time To Repair). That way, developers will not have to waste time cleaning up after those issues. The further down the road we discover bugs to fix, the more expensive they are.

The reliability teams are not here to slow projects down, it is quite the opposite: they are here to improve their long-term velocity, while increasing their reliability.

Data Engineer vs. Data Reliability Engineer

Data Engineers are responsible for developing data pipelines and appropriately testing their code.

Data Reliability Engineers are responsible for supporting the pipelines in production by monitoring the infrastructure and data quality.

In other words, Data Engineering teams usually perform unit and regression tests that address known or predictable data issues before the code goes to production. DRE teams instrument the production environment to detect unknown problems before impacting the end-users.

What do we do?

DRE teams have the goal of setting and maintaining standards for the accuracy and the reliability of production data, while enabling velocity for data and analytics and machine learning engineers. The DRE team is more than just reacting to machine learning and data outages, they are in charge of preemptively identifying and fixing potential problems, and producing automated ways of testing and validating data, automatically detecting PII (Personal Identifiable Information) in different areas of the ecosystem, etc.

Areas that DREs would have purview over, include:

Data lifecycle procedures (e.g., when, and how data gets deprecated)
Data SLI (Service Level Indicator), Data SLA (Service Level Agreements), Data SLO (Service Level Objective) definition and documentation
Data observability strategy and implementation
Data pipeline code review and testing
Helps with the automation of data movement
Helps with the management of data incidents
Data outage triage and response process
Automating data related processes in the infrastructure to constantly remove toil
Data ownership strategy and documentation
Education and culture-building (e.g., internal roadshow to explain data SLAs)
Developing guardrails around data processes to increase data reliability, availability, and privacy
Monitoring costs of data activities (pipelines, storage, compute, network, etc.)
Track the lineage of the data
Perform change management when data tooling changes
Ensure cross-team communication regarding data activities

Ensure PII (Personal Identifiable Information) is properly handled in the data ecosystem
Ensure the business is compliant with all regulations regarding data (i.e., GDPR, etc.)
Ensure that the Machine Learning models are versioned, reproducible, evaluated, monitored and comply with overall software engineering best practices

DRE teams do not just put out fires. They put the guardrails in place to prevent the fires from happening. They enable agility for ML engineers, analytics engineers, and data scientists, keeping them moving quickly knowing that safety guards are in place to prevent changes to the data model from impacting production. Data teams are always balancing speed with reliability. The Data Reliability Engineer owns the strategies for achieving that balance.

Searching any Web link within a Org library using org-ql

December 6, 2023December 6, 2023 Frederick Giasson

Jon Snader of Irreal recently made a few blog post about using Emacs as “bookmark launcher”. You can read them here and here. Those methods were working well, but I was missing something.

I am working with Org a lot these days. Not for literate programming as I normally do, but mainly using Org-Roam to build my technical knowledge base, to write meeting debriefs and all kind of other planning tasks. In that process, I do add a lot of reference to different websites in several different Org files.

What those blog posts from Jon triggered in me is a need to be able to easily search for this disparate links scattered across hundred of Org files. It was on my todo, but I was lacking time to dedicate to this.

But, for some reason, I took the opportunity to take 15 minutes to learn about org-ql today. While looking at its codebase, I quickly noticed two functions: org-ql-open-link which uses Vertico to search for links within the current Org buffer. A few minutes later I noticed the org-ql-find-in-agenda which find anything within all the files in the agenda file.

org-ql-open-link-in-agenda

Within a few minutes, I merged those two functions to create a new org-ql-open-link-in-agenda which do exactly what I was looking for: searching within all my Org files for links, and leveraging Vertico or Helm to expose them to me. The other neat thing is that org-ql-open-link does put some context in the search result by appending the header of the section where the link appears.

I ended up creating those two function in my Emacs config file:

  (defun org-ql-open-link-in-agenda ()
    "Call `org-ql-open-link' on `org-agenda-files'."
    (interactive)
    (org-ql-open-link (org-agenda-files)))

  (defun org-ql-open-link-in-org-directory ()
    "Call `org-ql-open-link' on files in `org-directory'."
    (interactive)
    (org-ql-open-link (org-ql-search-directories-files)))

One does search across all Org files that appears in the agenda file, the other one search within all Org files of a directory. If you want to use org-ql-open-link-in-org-directory, you simply have to put the following variable to true like this:

  (setq org-ql-search-directories-files-recursive t)

Finally, create a new keybinding that suite you to use those new functions as often as you need.

I just submitted a PR to try to add those directly in org-ql.

Literate Programming: for DevOps, MLOps and Infrastructure as Code in General

November 2, 2023 Frederick Giasson

Software developers generally tells me that they don’t have to document most of anything since the code is the documentation: just read it. This is when I reply: the code tells me the how you did it (if lucky) the words around the code should tell me the why.

With the recent (last 15 years) emergence of Infrastructure as Code (IaC), new important specialized developer roles such as DevOps and now MLOps, I will argue that literate programming concepts are becoming more and more important to the software industry.

Infrastructure as Code

In the last fifteen years, we saw the emergence of several Domain Specific Languages (DSL) to help system administrators to manage and provision their infrastructure. Those DSL revolutionized the way infrastructures were created and cared for.

Infrastructures were now entirely defined in plain text files. Those files could be versioned, they would get special treatments in IDEs, complete infrastructure could be rollbacked, etc.

But, there is one special characteristics that IaC has: the side effects of the “infrastructure code” are huge because a single line can lead to provision, or destroy, huge number of hardware resources which can have dramatic physical or monetary impacts.

Importance of Documentation

In this context, I argue about the importance of literate programming principles to code and maintain IaC. Some of the DSL are very opaque, a very small change can have dramatic side effects. IaC also has deal with versions of several tens, if not hundred, pieces of software that have to work together. IaC creates very complex networks of computers in different regions of the World.

None of that is self evident in code written in those DSL. The why needs to be documented very carefully, and documentation needs to be as close as possible to the DSLs code because every time something changes in the infrastructure, the change needs to be reflected in the text that describes the rational of that piece of infrastructure. And finally, both the how and the why will be carefully peer reviewed in a pull request.

Org-mode: Agnostic Literate Programming Framework

Considering that there exists a specific DSL per framework (Terraform, Ansible, Docker, Puppet, Chef, etc.) it is important to have a literate programming framework that is language agnostic (unlike CWeb, nbdev, etc.). This is why I strongly support Org-mode. If the DevOps/MLOps developers works within Emacs, they have all the power of all the major modes already existing in Emacs to manipulate the code of those DSL within the code blocks. If they don’t, they can always use their favorite IDE with a Org-mode command line utility like OrgWeb.

Few Examples

Let’s take a look at what it looks like in the wild. Here are two examples, one that describes a Dockerfile and a series of Ansible playbooks, properly rendered in GitHub.

Literate Dockerfile

The first example is orgweb’s Docker.org file where the Dockerfile if generated. As you can see, everything of importance about the generated Docker image is stated. The version of Alpine Linux used, which version of Emacs is shipped with, the reason why we choose Alpine in the first place, etc.

Then it explains why the Dockerfile needs to install the tf-dejavu package, and what happens if it is not installed. And then it explains why the install.el and site-start.el files are being copied over and what they are used for. And finally, why install.el is being run while building the image.

Literate Ansible

Here is another example from a project I stumbled upon recently. This repository is a set of Ansible playbook to provision a series of infrastructure resources such as a docker registry, longhorn, etc.

OrgWeb: CLI Org-Mode Environment for WEB like development without Emacs

October 31, 2023October 31, 2023 Frederick Giasson

There is a wide range of tools and framework currently available for doing literate programming development. You have the ancestors like CWEB, NOWEB and nuweb. You have full editors like Leo. And then you have more modern approaches like nbdev, PyWebTool and FSharp.Formatting.

However, most of them are specific to a programming language. Some of them are general like NOWEB, but they are lacking some kind of integrations in modern IDE environments.

For the last 8 years, I always fallback to the same: Org-Mode.

Org-mode

Org-mode is many things, but its most interesting feature has always been its code blocks to me. Org’s syntax is clean and powerful. Org is not specific to a particular programming languages: it supports tens of programming languages or other kind of configuration/scripting languages. Code blocks can be executed, tangled or weaved.

Its drawback: the best (and frankly only) Org-mode implementation is in Emacs. Some, myself included, will say it is great because we love working with Emacs, and are happily willing to pay the cost. But we are not the norm, but the exception. Emacs is wonderfully different and it doesn’t appeal to all developers. I can understand that in today’s industry where the only incentive is to ship, ship, ship features.

But, how could we get the best of Org-mode without having to force people to use Emacs? One possibility could be to develop Org-mode plugins for other IDEs, the first on the list would most likely be VS Code. But this is not a small undertaking.

Org-mode CLI

There are some modules existing in other IDEs that support org-mode like on Vim, VS Code, etc. But those are mostly syntax highlighter, or implement some features mostly related to org-agenda and headings manipulation. It is a good start, but far from enough for a literate programming framework.

The goal of OrgWeb is to develop a simple tool that any developer could use to leverage the full power of doing literate programming using Org-mode and their preferred IDE.

OrgWeb

OrgWeb is a simple CLI tool that can be installed using this command:

pip install orgweb

The tool only has four commands:

tangle: extract code from code blocks into their source files
detangle: sync source files back to their original Org-mode code blocks
execute: execute code blocks such that they produce their side effects
monitor: monitor local file system to tangle/detangle files automatically

The tangle, detangle and execute commands can be performed on a folder (recursively) or one or multiple specific files.

Note: I am not covering all the details of how we can use Org mode to do literate programming. You can search my blog which has plenty of posts about that, but also refer to the Org-mode documentation to read about all and every features available to you.

In addition to the orgweb CLI, you will need Docker available in your environment. If it is not already installed, you can follow those instructions to install it on your system.

VS Code + Org-mode

For this blog post, I will cover how Org-mode can be used in conjunction with VS Code to develop an application using literate programming. To start, you can simply clone OrgWeb’s repository, install this Org-mode module in VS Code. The general development layout is:

In the bottom window, this is where we have the terminal instances. This is where OrgWeb commands happens. In the main edit window, this is where the Org files, or the tangled source files will be manipulated.

You can notice that the Org-mode VS code module does some basic syntax highlighting, even within the code blocks using Python’s syntax highlighter. This is far than enough to easily understand and follow the Org files.

Tangling

Tangling is the action of extracting code blocks from a literate file into its executable source code file.

Once ready to tangle the Org file, this command will tangle that specific file:

orgweb tangle . --file main.org

It asks OrgWeb to tangle the current directory . but to only tangle the main.org file. It will find all the Org files recursively, and tangle only the ones specified. If no files are specified, it will tangle all the Org files it finds. Then the main.py file will be generated from all the code blocks from main.org.

Detangling

Developers will often end-up working on the source files that have been generated from Org files. There are all kind of reasons for that, such as modifying a source file while debugging an application. When this happens, the literate Org files and the source files get desynchronized. Changes could be copy/paste to the Org files, but there is a much easier way to do it: detangling.

Detangling synchronize back any tangled code blocks from source files to their original Org file:

orgweb detangle . --file main.py

It asks OrgWeb to detangle the current directory . but to only detangle the main.py file. It will find all the Org files recursively, and tangle only the ones specified. Then the main.org file will be generated from all the code blocks from main.py.

Executing

The execute command is like the tangling command but instead of moving code in source files, it does execute the code blocks that needs to be executed. Code blocks that get executed produces side effects. It is those side effects that we want to force with the execute command.

One example are the PlantUML code blocks in the README.org file. When we execute them, the schema images will be generated.

Monitoring

This is all good, but it is still inconvenient to have to run commands in the terminal every time you want to tangle or detangle some files.

This is why there exists the monitor command:

The orgweb monitor . command will keep monitoring the specified folder. Every file that changes within that folder (recursively) will potentially be tangled or detangled by the running orgweb instance. If a .orgwebignore file exists in the target folder, then everything within that file will be ignores by the monitoring process.

In the envisioned development workflow, developers will simply run the monitoring in background such that every Org and source files automatically gets tangled and detangled every time they are saved. That way, developers will be sure that both files are always in sync.

How does it work?

As we know, orgweb is designed in a way that developers can use all the power of Org-mode, in any IDE they like, without having to rely on Emacs directly.

To do that, it leverages Docker to build an image where Emacs is properly installed and configured to implement the commands that are exposed via the command line tool.

If the orgweb docker image is not currently existing in the environment, then it will request Docker to build the image using the Dockerfile. The build process will install and configure all the components required to implement all orgweb commands.

orgweb check if it exists every time it is invoked from the command line. This process will happen any time that the image is not available in the environment.

If the image is existing in the environment, then the following will happen.

orgweb will ask Docker to create a container based on the image. Once the container is running, it will execute a command on the container’s terminal to run Emacs. Emacs is used directly from the command line by evaluating ELisp code it gets as input.

Every time a orgweb command line is executed, a new container is created and when the commands finishes, the container gets deleted:

Other possible avenues

OrgWeb works fine, but it won’t ever be as interesting as a proper IDE integration, like what is available in Emacs. Another interesting option worth investigating would be to use Emacs as a Org-mode backend of a LSP server. That way, IDE modules developers could more easily develop fully fledged Org-mode modules for specific IDE integration. That way, we could “easily” get the full Org-mode power within any IDE, being able to not only leverage code blocks but org-agenda, org-roam, tagging, date time, org-capture, etc.

Impact of Careful Naming when Using GitHub Copilot

October 4, 2023October 4, 2023 Frederick Giasson

Today, I continue my investigation of how I can better leverage tools such as GitHub Copilot, and their impact on the work of software developers. I recently investigated how such tools can benefit from Literate Programming methodology.

I this new post, I am investigating the importance of carefully naming of functions, parameters and variables, and the impact on the performance of the tool.

Naming Things

More than 20 years ago, David Thomas and Andrew Hunt wrote in The Pragmatic Programmer:

The beginning of wisdom is to call things by their proper name.

— Confucius

What’s in a name? When we’re programming, the answer is “everything!”

We create names for applications, subsystems, modules, functions, variables — we’re constantly creating new things and bestowing names on them. And those names are very, very important, because they reveal a lot about your intent and belief.

We believe that things should be named according to the role they play in your code. This means that, whenever you create something, you need to pause and think “what is my motivation to create this?”

This is a powerful question, because it takes you out of the immediate problem-solving mindset and makes you look at the bigger picture. When you consider the role of a variable or function, you’re thinking about what is special about it, about what it can do, and what it interacts with. Often, we find ourselves realizing that what we were about to do made no sense, all because we couldn’t come up with an appropriate name.

Naming has always been a very hard and important problem in computer science, and this reality won’t change any time soon, if anything, it will get even more important in the coming new era composed of LLMs and Copilot like systems and tools.

The current premise we live with since roughly last Christmas, is that software developers productivity will experience a major boost helped by the new type of tooling that is becoming available, namely GitHub Copilot and its integration in VS Code. If the premise is true, and I have no indication at the time of this writing that it won’t, then the next immediate question become: how can we best use those tools to get the most pleasant and effective productivity boost?

Today, I am investigating the aspect of naming.

Meaningful vs. Meaningless

For this investigation, I will implement the exercise #10 of chapter 5.0.0 of The Art of Computer Programming:

10. [15] You are given a tape containing one million words of data. How do you determine how many distinct words are present open the tape?

The implementation will be in Python that will only require a handful of functions. This goes against the intent of the exercise, but I was lacking imagination to find something to code for this post.

For this experimentation, I created two empty and distinct workspaces. I loaded each of the workspace in different VS Code instances. The purpose here is to make sure that Copilot didn’t get any hint from elsewhere in the Workspace about my intents.

Then, I purposely didn’t write any comments, any text of any kind other than pure, uncommented, Python code. The rough structure of the implementation is:

Use a book from the Gutenberg project as the source of token. In this case, we will use Marcel Proust’s translation of John Ruskin’s La Bible D’Amiens
Tokenize the book in words
Create a set of distinct words/tokens

Hopefully Meaningful Naming

For the first iteration, I tried to come up with hopefully more meaningful name for the functions and their parameters. The first step is to get a book from the Gutenberg project.

First thing I did is to start by typing the function name def get_project_guttenberg (notice the typo to Gutenberg).

The initial suggestion is not that helpful. It returns a string. But the function’s name is not that meaningful either. What does that mean? Am I looking for a project description for some kind of “Gutenberg project”?

Then I continued to type with def get_project_guttenberg_book That time, Copilot started to guess that I wanted to read a text book file from the file system. In reality, I am intending to simply download it directly from the web and not from the local file system. But we can see that it is heading in the right direction.

Then, I started to type the parameters of the function I was about to create def get_project_guttenberg_book(url. When I specify that the function is expecting a URL as input, it “understood” (really guessing at this point) that I want to get the book’s text from the web, so it suggested the following code:

def get_project_guttenberg_book(url: str) -> str:
    """Returns the text of a project guttenberg book"""
    return requests.get(url).text

This is working, and the function’s design is working since the intent is clear even if we can fetch any Web document using that function (and not just Gutenberg project text books). We can always refine that function later when necessary.

But what if I would have used a different parameter’s name, let’s try this: def get_project_gutenberg_book(ebook_id:

By simply changing the parameter’s name by a more meaningful one creates a more specialized and purposeful function:

def get_project_gutenberg_book(ebook_id: str) -> str:
    """Returns the text of a project gutenberg book"""
    return requests.get(f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt").text

Next step is to create the “data tape” from that source of text. Starting by typing def data_tape_ we can see that Copilot grasp the general intent of what we are trying to do even if it is not there yet.

If we tweak the function’s name a little bit with def get_data_tape_from_book(url then we are getting a more contextualized suggestion from Copilot. It is way too specific and probably hallucinating a little bit the structure of a Gutenberg book from its training set.

We will keep that suggestion and shorten the split() call to:

def get_data_tape_from_book(url: str) -> list:
    """Returns the data tape from a project guttenberg book"""
    return get_project_guttenberg_book(url).split()

The last step is to get the set of distinct words (from which we can calculate its len()). By starting typing def distinct_words_from_tape we are getting an adequate solution:

The tape is a list, it returns a set, a set of composed of distinct tokens. Simple and effective leverage of Python’s data structure.

def distinct_words_from_tape(tape: list) -> set:
    """Returns the distinct words from a data tape"""
    return set(tape)

We are done, the following gives us the number we are looking for:

len(distinct_words_from_tape(get_data_tape_from_book("https://www.gutenberg.org/cache/epub/62615/pg62615.txt")))

Less Meaningful Naming

Now, see what happens when I use less meaningful names, when my brain becomes sloppier. def get_data(d gives us:

Not really what we are looking for, but this is understandable since there is zero context. Let’s continue with our initial idea by writing: def get_data(d): return requests. Now it is guessing that the d parameter is some data structure from which it can find the URL for the requests. And then it guesses that the request will read some json file that will need to be parsed. Most likely just because when people use the word data, they refer to some kind of structured data, and most likely that the most widely semi-structured data format out there exchanged over the web is still JSON. This is just what the Copilot learned from millions of Python projects.

If we continue typing def get_data(d): return requests.get(d) then it “thinks” that it will return some arrays where a URL will be accessible. At that time, it is just starting to hallucinate a solution.

I end-up simply writing that naive function without using any of the Copilot suggestions:

def get_data(d):
   return requests.get(d).text

Now that we have put the ground for some context with the get_data()function, let’s see how it goes to write the get_tape function. After typing def get_rape( I got:

It just got it from the code I produced in get_data(). If I continue typing to change the parameter def get_tape(t), I get:

Not much more helpful. Let’s continue to type the body of the function: def get_tape(t): return get_data(t).spl it will finally propose to split on \n. But in reality, this is unnecessary and too narrow because when the first parameter of the split() function is None then the following happens:

When set to None (the default value), will split on any whitespace character (including \n \r \t \f and spaces) and will discard empty strings from the result.

The end result is that I didn’t use any suggestion to write get_tape():

def get_tape(t):
   return get_data(t).split()

Finally, we want to get the distinct words from the data tape. Start writing def distinct it almost propose the right thing. However, there is no reason to convert the set to a list before returning it:

I ended up not using any suggestion for this function either:

def distinct(t):
   return set(t)

Conclusion

As we saw with those two examples, the proper naming of things is very important to get the most of this new kind of tooling. I agree that the second example is extreme, but in my experience they are not uncommon names that we can find in code bases. I didn’t try to obfuscate every name, the names where just too generals and a bit useless.

When David and Andrew wrote twenty years ago:

those names are very, very important, because they reveal a lot about your intent and belief

They considered those names very very important such that your intent and belief could be properly communicated to whoever read your code in the future (including you a few months from then). Today, this assertion stands true, but its scope is broader. Names are very, very important, because they also instruct assistant tools such as GitHub Copilot to more easily guess your intent and belief to help you write better code faster.

What I personally like with this new family of tools such as GitHub Copilot is how I think they will shape the software developers of tomorrow, how it will force them to be more careful about their writing, and in this case their naming. The better the writing, the most precise and unambiguous it is, the more power they will be able to harness from those LLMs.

I start to envision that the general productivity of software developers will experience an important boost in the coming five to ten years, but also (and more importantly to me) an overall increase in the quality of the code and systems they produce. All this because the tool became a huge incentive for them to care about those mundane non-code details such as writing mundane humans words.

Today, I feel that a lot of developers wonder if their job is at sake. In my next post, I will start outline what I am currently feeling around those questions. I don’t think developers job are at sake, but the way they work will definitely have to change, and the way we train future generations of software developers will have to change as well.