Production ML Papers to Know
Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.
The challenge of changing user preferences
One of the biggest challenges in recommender systems is non-stationarity. Users' tastes and behaviors change (often as a result of the predictions your model makes!), and as they do, the data distribution changes, degrading the performance of your model.
The key innovation in Monolith, TikTok’s large-scale recommendation system, is that it can respond to changes in preferences — fast — using an online training system.
Online training is the key to Monolith’s ability to quickly respond to changes in user preferences. Here’s how they do it.
First, both features and user actions are needed to train the model, but they arrive at different times from different parts of the system. Monolith handles this by logging each to separate Kafka queues, and joining them using an online joiner module written in Flink.
Next, a training worker picks up training examples and performs training. One of the clever parts of the architecture is that it always uses the same training worker, whether you’re building a new model with a batch of historical data or you’re updating an existing model online.
As model parameters continue to change on-the-fly, they need to be periodically synchronized with the model server. This presents two technical challenges. First, there can’t be a gap in model serving, and second, they need to avoid transferring the multi-terabyte set of model parameters over the network for each update.
It does this by frequently updating the sparse parameters of the embedding tables, which make up a large part of the DNN. This results in a relatively small update to be pushed across the network. The dense parameters of the DNN weights are updated less frequently. This inconsistency in updating the model has not led to a loss of model performance.
Monolith was tested in a series of experiments, which found that real-time online training consistently improved model quality, and that models with smaller parameter synchronization interval periods performed better than those with larger intervals.
The paper also covers an interesting approach to addressing the challenge of the sparse, categorical and dynamic nature of user data, which can result in the embeddings used to preprocess this data becoming "enormous."
Hashing is typically used to solve this problem, but this can result in collisions that reduce model quality. Monolith addresses this through a collisionless hash table that has the elasticity to adjust as embeddings grow, and which was shown in the paper to consistently outperform models which use collision-based approaches.
The collisionless hash table is based on cuckoo hashing, The figure below illustrates how it works: it maintains two tables, 𝑇0, 𝑇1, each with different hash functions, h0(𝑥),h1(𝑥). An element can be stored in either table. If an element is already in place, this element is evicted and placed elsewhere, and this process continues until all elements are stabilized.
There is also a focus on memory footprint reduction through ID filtering. IDs that appear only a handful of times, or have been inactive for a period of time, are filtered out, with the threshold for filtering treated as a tunable hyperparameter during model training.
This paper provides a fascinating insight into how recommender systems operate at an industrial scale, and how companies like Bytedance are driving improvement in their operations.
You may not be operating at TikTok scale, but if you work with rapidly-changing user data, you might want to consider moving to something like their online training approach.
The paper is here.