A ML design task can broadly be classified into the following components, in order of priority:

  • Defining success metrics
  • Data pipelines
  • Backend model design
  • Online model design
  • Model evaluation
  • Model monitoring
  • Serving

Defining success metrics

The most important aspect when building a ML system for a specific task is to be able to identify the success criteria. For example, for a recommendation system, this could be user click-rate or engagement, while for a predictive system, this could be as simple as mean abosolute error. Another thing to keep in mind is the logistical requirements of the model, like throughput, latency, scalability and availability.

Data pipelines

These essentially define how data for training or inference is flowing in, how it is ingested, what pipelines are setup for ETL. At this stage, we also need to think about the refresh rate, how the data is stored, what’s an optimal way to fetch the data, etc. For example for training a recommendation system, one might need past click data, and some useful features might be user specific metadata like gender, ethnicity, occupation, income, age group, etc and environment specific attributes like time, location, history of usage, platform, etc; each of which can be stored separtely and have their associated pipelines for data cleaning and normalization. Note that not all these features would be available in the raw data and trade-offs would have to be made to either collect these attributes or abandon them altogether.

  • If data is collected in batches and refreshed with some cadence one needs to also think about data versioning.
  • Another really important point to think about is ethical issues with respect to data anonymity and privacy.

For the usage history for a given users we can an array of 100 most recently watched/viewed/played items. Each item can itself be an embedding into a high-dimensional space. Each item can also be weighted depending on the delta between the user’s engagement with the item as compared to the average user engagement. For example, if a user watches a certain video and his watch fraction is 95%; while an average user’s watch fraction is only 50%, then there is a high signal that the user like such and such kind of videos and hence this particular array item should be given a higher weight.

Backend model design

If the task at hand involves user engagement and personalization, then the ML task can be broken into two component - a backend model component that operates at a lower refresh rate (high latency) and an online model component that needs to have a small latency. This latency requirements impose the model architechtures that can be used. For example, for online task, a small light-weight model would be the right choice. The backend model usually will process data in batch, i.e. for thousands of users at a time.

  • It is always useful to have a simple model that can be treated as a baseline against which to measure any other complex model.
  • One needs to also think of developing some kind of model registry to keep track of model requirements and versioning.

Online model design

The online model only needs to process a single item batch_size=1 at a time. At the start of each session the current state is simply the output of the last run of the backend model $A^{[l-1]}$. Each interaction can be thought of as a new datapoint token $x^{[l]}$. The job of the online model is to then compute the new state given the previous state and the new data point in some fashion. In the simplest case, this could just be a linear model: $A^{[l]} = W (A^{[l-1]}, x^{[l]})$. The interaction data-point is backed up into a database or an object store of some kind to be able to provide users history of their interactions.

Model evaluation

Segmenting the metrics based on certain features like country, age group, new vs old users is required to make sure that the model is performing on-par across user types.

Serving

Model monitoring

Model deteriorate over time, either due to covariate shifts or biases. It is therefore important to track relevant monitoring metrics. This allows us to identify when a if either a re-training needs to be done or if the model architechture as a whole needs to be updated.

A useful approach to handle catastrophic model quality deterioration is to have a separate benchmark model that too can be continuously trained. Once the model performance decreases that a certain threshold, one can quickly switch out to the benchmark model.