Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Why do ML Projects Fail?

The thesis of today’s paper (from Weights and Biases) is that the hard parts of doing ML are (i) enabling people to succeed and (ii) having well-designed processes in place. Of course, (iii) the technology platform matters, but to a lesser degree.

Let’s talk through some of the main failure modes to avoid.


To be honest, you already probably know how to help your people succeed at delivering ML. That doesn’t stop most teams from making some common mistakes.

First, the authors observe that people perform better when their roles are clearly defined and they sit in the right part of the org (whether that’s embedded in the business unit or in a separate team). That sounds pretty obvious, yet it’s still common for everyone on the team to cover every part of ML, from data cleaning to modeling to platform engineering.

Second, the authors assert that ML requires a different approach to project management that views ML delivery as R&D and not business-as-usual. That’s because model performance requirements can be unknown, progress non-linear, and breakthroughs hard to predict, making estimation of timelines and business value hard.

For those of us in ML day-to-day that won’t come as a surprise, but many teams still fail by focusing on low-risk - and low-ROI - tasks, and using management techniques more suited to software management, such as agile and scrum.


Like project planning and team organization, processes are the fruit and vegetables of successful ML delivery. You already know you’re supposed to be doing these things, but it’s hard to be consistent.

For example, ML teams need to understand the business value a project can achieve. So the authors recommend doing a scoping and opportunity sizing process before starting the project.

The paper includes a framework, originally provided by, that covers the business drivers to consider upfront:

The paper also points out that “ML is often used to optimize operational decisions. However, models only provide predictions, not business decisions.” So the authors recommend doing a “decision optimization” process to calibrate model outputs to business decisions.

This might be as straightforward as recognizing that the cost of false positives results for a classification model are high, and tweaking the decision threshold accordingly; or more involved, for example when a model output is used to inform a profit curve that underpin decision making.

Another interesting process suggestion is to use governance to find and address potential sources of bias. Drawing from a paper - A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle - it illustrates seven types of machine learning bias that may negatively impact a model:

Each type of bias has different causes and different potential mitigations, so it’s helpful to have a governance process that is focused on the project as a whole, not any one particular piece (like model evaluation).


Though the authors believe that people and processes are the most effective lever for accelerating ML teams, infrastructure plays a role as well.

When selecting tools for an end-to-end platform, keep in mind that there’s no one-size-fits-all solution. Things that affect the right tool stack include skill sets of your org (for example, if Kubernetes is required, is there sufficient knowledge of this on the team?); the existing infrastructure, and whether a set of specialized point solutions is better than a monolith solution.

As an editorial note, I’d add that the product you’re trying to build with ML has a bigger impact on the right tools than any of the factors outlined above. There is no monolithic MLOps stack — the right tools for building a lead segmentation model are completely different than those needed to build a chatbot.

The paper also highlights the need to identify the right level of abstractions for data scientists, so that they can focus on high-value activities. This is tricky because of a “fundamental mismatch between how much infrastructure is needed at different parts of the stack and what data scientists care about”, as illustrated by the chart below:

Data scientists tend to care less about capabilities that require more technological infrastructure. Source: Effective Data Science Infrastructure

Generally speaking, data scientists want to spend more time in model development and feature engineering - this means choosing tooling at a higher level of abstraction in other parts of the stack.

So what?

MLOps has been great for the industry because it has led to increased focus on getting models out of the lab and into production. But MLOps is also misguided, because it puts the focus on tools and infrastructure rather than people and products.

This paper’s contribution is moving beyond the usual focus on the technology and tools required to support MLOps, to areas such as organization design, managing projects, and determining determining business value.

The paper can be found here.