Production ML Papers to Know

This is a continuation of Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

This blog-post started life as part of our weekly newsletter in the first week of September, so you may have caught the material already.

Fixes that Fail: Self-Defeating Improvements in Machine Learning Systems

Machine learning is undergoing a modularity revolution.

ML purists are trained to appreciate end-to-end models, like a self-driving car that maps raw sensor inputs directly to motor commands and is trained directly to get from point A to point B, avoid collisions, etc.

However, in the real world, increasingly ML systems are composed of several (or even thousands) of models working together.

A modular approach to self-driving might involve 3 models just for obstacle detection.

  • Depth estimation model: convert raw images from the cameras to point clouds
  • Car detection: use the point cloud to find the location of all cars in the scene
  • Pedestrian detection: use both the point cloud and the raw image to locate all pedestrians in the scene

Modularity has some huge advantages.

  • Cost savings. You can use off-the-shelf components from providers like HuggingFace or OpenAI rather than rolling your own
  • Easier debugging. Let's say the car makes a wrong turn. What's easier to debug: an end-to-end neural network that directly maps pixels to actions, or a pipeline of models with human-understandable outputs like point clouds or the presence of a pedestrian?
  • Separation of concerns.  It's hard to scale up the number of people working on a single model. With modularity, different teams can just work on different models
  • Flexible deployment.  You don't need redeploy the whole system to update one model. E.g., language models may not need to be updated as frequently as a model that describes user preferences

Modularity also has a big disadvantage, as first observed in the High Interest Credit Card paper.

A common example is seeing performance degrade because retraining a model changes its output distribution.

For example, say you're building a recommendation pipeline that depends on a user embedding model. Retraining that model can break downstream recommendations, even if it's better, because the downstream models are trained expecting the distribution of embeddings produced by the old model.

What if you just retrain all downstream models whenever an upstream model changes? It may not be the most efficient solution, but it will avoid many of these kinds of errors.

Self-Defeating Improvements in ML Systems

The paper we're considering this week characterizes situations where improving a model makes the system worse, even if you retrain all models.

The authors decompose the error of a two-model pipeline into three parts

  • Upstream error
  • Downstream approximation error
  • Downstream estimation error

Upstream error occurs when upstream improvements obscure information needed downstream.

For example, using a squared error metric for depth estimation will heavily penalize errors for distant objects. Improving the squared error might improve predictions for distant objects at the expense of close ones. If the downstream model needs good depth estimations up close to perform its task well, we have an example of loss mismatch.

Downstream approximation error occurs when the downstream model's architecture can't take advantage of better upstream predictions.

The figure below shows a toy example of how this could make the overall performance worse.

Finally, estimation error occurs when the real world gets in the way: it's possible to find a better model, you just can't in practice with finite data and an imperfect optimizer.

When you have more than two models, there are even more failure modes.

But does any of this occur in practice?


The authors demonstrate that it can, by building the pedestrian and car detection pipeline described above.

They show that improving the depth estimation model leads to worse detection performance. Ouch!

In this case, they conclude that the degradation in car detection performance is likely due to upstream error. This makes sense: I suspect upstream estimation error is the most likely of the error types to occur in practice with modern deep-learning based pipelines.

Surprisingly though, they find that the degradation in pedestrian detection performance is likely due to one of the other error types.

The upshot

What does this mean for us as builders of ML-powered products?

  • Remember that modularity is powerful but adds complexity. Modularity in traditional software is an imperfect analogy, because in ML, Changing Anything Changes Everything
  • A simple but imperfect fix to avoid many of the most common errors in ML pipelines is to retrain downstream models when upstream models change
  • As your pipelines grow in complexity, it's critical to also invest in measurement. Good measurement can help you detect the kind of issues described in this paper, as well as the many other kinds you will discover as your models interact with end-users

Check out the paper here: