{"id":3947,"date":"2024-05-17T08:51:54","date_gmt":"2024-05-17T13:51:54","guid":{"rendered":"https:\/\/fgiasson.com\/blog\/?p=3947"},"modified":"2024-05-17T08:51:54","modified_gmt":"2024-05-17T13:51:54","slug":"data-reliability-engineering","status":"publish","type":"post","link":"https:\/\/fgiasson.com\/blog\/index.php\/2024\/05\/17\/data-reliability-engineering\/","title":{"rendered":"Data Reliability Engineering"},"content":{"rendered":"\n<p id=\"ember564\" class=\"ember-view reader-content-blocks__paragraph\">I am happy to be able to share about one of the things that I have been up to since I started working at Dayforce. What Is that thing?<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember565\" class=\"ember-view reader-content-blocks__paragraph\"><strong>Data Reliability Engineering<span class=\"white-space-pre\"> <\/span><\/strong><\/p>\n<p id=\"ember566\" class=\"ember-view reader-content-blocks__paragraph\">I had the opportunity to put in place a new functional area called Data Reliability Engineering. This may look good, but you may wonder what this thing is about.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember567\" class=\"ember-view reader-content-blocks__paragraph\">Data Reliability Engineering (DRE) can be seen as a child of Site Reliability Engineering (SRE). The foundation of DRE is SRE. Organizationally speaking, we embedded DRE in the SRE organization at Dayforce.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember568\" class=\"ember-view reader-content-blocks__paragraph\">DRE is SRE for Machine Learning and Data systems.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember569\" class=\"ember-view reader-content-blocks__paragraph\">A DRE team focuses on, and is responsible for, ensuring that data pipelines, storage, and retrieval systems are reliable, robust, and scalable. It borrows principles from software engineering, DevOps, and site reliability engineering (SRE), to apply them to data-intensive systems.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember570\" class=\"ember-view reader-content-blocks__paragraph\">The goal of the team is to ensure that data, which is a critical business asset, is consistently available, accurate, and timely available for different processes such as auditing, machine learning data training, analysis, and to different stakeholders such as data scientists, ML engineers, data analysts, etc.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember571\" class=\"ember-view reader-content-blocks__paragraph\">A DRE team makes sure that the right Data Service-Level Indicators (DSLIs) are in place, that the Data Service-Level Objectives (DSLOs) and Agreements (DSLAs) are respected and constantly monitored. It also helps with the automation of the data movements, to increase the observability of the data pipelines and data systems, with the management of incidents incurring data availability and supporting teams with all the above.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember572\" class=\"ember-view reader-content-blocks__paragraph\">Overall, it ensures that the data used to generate analytics reports, machine learning models or any Dayforce features is accurate, reliable, and available on time.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember573\" class=\"ember-view reader-content-blocks__paragraph\">A data reliability engineer (DRE) is a professional responsible for implementing and managing data reliability engineering principles. They act as the guardians of data integrity and availability within the organization.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember574\" class=\"ember-view reader-content-blocks__paragraph\">The DRE team act as trusted advisors for the company, actively participating in data platform infrastructure design and scalability considerations. It is responsible for implementing and managing data reliability engineering principles. It acts as the guardian of data integrity and availability within the organization.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember575\" class=\"ember-view reader-content-blocks__paragraph\"><strong>Move Fast by Reducing the Cost of Failure<span class=\"white-space-pre\"> <\/span><\/strong><\/p>\n<p id=\"ember576\" class=\"ember-view reader-content-blocks__paragraph\">DRE helps teams to move fast by reducing the cost of failure of Machine Learning and Data projects. Some will say that it makes it a slow start, but it pays off in the long run. We focus on development velocity in the long term, not the short, burst of work to ship features.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember577\" class=\"ember-view reader-content-blocks__paragraph\">DRE (and SRE) helps improve the product development output.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember578\" class=\"ember-view reader-content-blocks__paragraph\">How? By reducing the MTTR (Mean Time To Repair). That way, developers will not have to waste time cleaning up after those issues. The further down the road we discover bugs to fix,<span class=\"white-space-pre\"> <\/span><a class=\"app-aware-link \" target=\"_self\" href=\"http:\/\/agilemodeling.com\/essays\/costOfChange.htm\" data-test-app-aware-link=\"\" rel=\"noopener\">the more expensive they are<\/a>.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember579\" class=\"ember-view reader-content-blocks__paragraph\">The reliability teams are not here to slow projects down, it is quite the opposite: they are here to improve their long-term velocity, while increasing their reliability.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember580\" class=\"ember-view reader-content-blocks__paragraph\"><strong>Data Engineer vs. Data Reliability Engineer<span class=\"white-space-pre\"> <\/span><\/strong><\/p>\n<p id=\"ember581\" class=\"ember-view reader-content-blocks__paragraph\">Data Engineers are responsible for developing data pipelines and appropriately testing their code.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember582\" class=\"ember-view reader-content-blocks__paragraph\">Data Reliability Engineers are responsible for supporting the pipelines in production by monitoring the infrastructure and data quality.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember583\" class=\"ember-view reader-content-blocks__paragraph\">In other words, Data Engineering teams usually perform unit and regression tests that address known or predictable data issues before the code goes to production. DRE teams instrument the production environment to detect unknown problems before impacting the end-users.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember584\" class=\"ember-view reader-content-blocks__paragraph\"><strong>What do we do?<span class=\"white-space-pre\"> <\/span><\/strong><\/p>\n<p id=\"ember585\" class=\"ember-view reader-content-blocks__paragraph\">DRE teams have the goal of setting and maintaining standards for the accuracy and the reliability of production data, while enabling velocity for data and analytics and machine learning engineers. The DRE team is more than just reacting to machine learning and data outages, they are in charge of preemptively identifying and fixing potential problems, and producing automated ways of testing and validating data, automatically detecting PII (Personal Identifiable Information) in different areas of the ecosystem, etc.<span class=\"white-space-pre\"> <\/span><\/p>\n<p id=\"ember586\" class=\"ember-view reader-content-blocks__paragraph\">Areas that DREs would have purview over, include:<span class=\"white-space-pre\"> <\/span><\/p>\n<ul>\n<li>Data lifecycle procedures (e.g., when, and how data gets deprecated)<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Data SLI (Service Level Indicator), Data SLA (Service Level Agreements), Data SLO (Service Level Objective) definition and documentation<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Data observability strategy and implementation<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Data pipeline code review and testing<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Helps with the automation of data movement<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Helps with the management of data incidents<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Data outage triage and response process<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Automating data related processes in the infrastructure to constantly remove toil<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Data ownership strategy and documentation<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Education and culture-building (e.g., internal roadshow to explain data SLAs)<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Developing guardrails around data processes to increase data reliability, availability, and privacy<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Monitoring costs of data activities (pipelines, storage, compute, network, etc.)<\/li>\n<li>Track the lineage of the data<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Perform change management when data tooling changes<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Ensure cross-team communication regarding data activities<span class=\"white-space-pre\"> <\/span><\/li>\n<\/ul>\n<ul>\n<li>Ensure PII (Personal Identifiable Information) is properly handled in the data ecosystem<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Ensure the business is compliant with all regulations regarding data (i.e., GDPR, etc.)<span class=\"white-space-pre\"> <\/span><\/li>\n<li>Ensure that the Machine Learning models are versioned, reproducible, evaluated, monitored and comply with overall software engineering best practices<span class=\"white-space-pre\"> <\/span><\/li>\n<\/ul>\n<p id=\"ember589\" class=\"ember-view reader-content-blocks__paragraph\">DRE teams do not just put out fires. They put the guardrails in place to prevent the fires from happening. They enable agility for ML engineers, analytics engineers, and data scientists, keeping them moving quickly knowing that safety guards are in place to prevent changes to the data model from impacting production. Data teams are always balancing speed with reliability. The Data Reliability Engineer owns the strategies for achieving that balance.<span class=\"white-space-pre\"> <\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am happy to be able to share about one of the things that I have been up to since I started working at Dayforce. What Is that thing? Data Reliability Engineering I had the opportunity to put in place a new functional area called Data Reliability Engineering. This may look good, but you may [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[293,309],"tags":[327,326],"class_list":["post-3947","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-mlops","tag-data-reliability-engineering","tag-dre"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3947","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=3947"}],"version-history":[{"count":1,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3947\/revisions"}],"predecessor-version":[{"id":3948,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3947\/revisions\/3948"}],"wp:attachment":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=3947"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=3947"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=3947"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}