12 symptoms of a Hidden Technical Debt in your ML project
About a year ago I stumbled upon a paper called “Machine Learning: The High-Interest Credit Card of Technical Debt” written by brilliant engineers from Google in 2014 (also, it has an older version with practically identical contents - “Hidden Technical Debt in Machine Learning Systems”). I found ideas from there very practical. Every time I come back to this paper, it offers me a fresh perspective on my current projects. So, I decided to rethink this paper in a form of a list of symptoms which can be used to assess a current or planned level of a technical debt in ML-project and create an action plan suitable for a corresponding situation.
I added some comments using italics and proposed several solutions (💡) which were not considered in the paper.
What is a technical debt
Technical debt is a metaphor that refers to an amount of work your team would have to do in a future to provide a quick solution right here right now.
Technical debt is not a curse and it is absolutely normal to increase it on the early stage of a new project to deliver results faster. Yet, it is very useful to have some guidelines helping to identify where technical debt can emerge in order to make informed decisions.
Symptoms of ML-Project Technical Debt
- Your model relies on many data sources
- Your model affects it’s own input data
- Your prediction service has undeclared consumers
- Your model relies on unstable data source
- You are using features with slim to none performance impact
- You have a lot of “glue code” because of a specific ML-package
- Data preparation stages turned into pipeline jungles
- You are mixing dead experimental code with a working one
- Your configuration files are very complex
- You have chosen a threshold for your model manually
- Your model relies on non-causal correlations
- Your ML-system monitoring and testing require improvements
❓ Your model relies on many data sources
❗ If one of the features changes it’s distribution - the prediction behavior might change drastically (CACE principle:Changing Anything Changes Everything).
✅ Isolate models based on different sources and serve ensembles. In some cases this solution may have bad scalability and add cost on maintaining separate models. Case from my practice: my team used a combination of a linear model and a boosting algorithm to make predictions for out-of-sample objects.
✅ Gain a deep understanding of your data. For example, you may build several models on various slices of your data and inspect metrics you receive. This is an excellent advice in any situation: the better you understand your data and your model - the less surprises you are going to face.
✅ Add a regularization punishing for diverging from the prior model’s predictions. This increases chances for your model to achieve the same local minimum it has converged to on the previous run.
❓ Your model affects it’s own input data
❗ It may make it difficult to analyse system performance and predict its behavior before it is released. In the worst case this feedback loop can be hidden.
✅ Isolate certain parts of data from the influence of your model.
✅ Identify hidden input loops and get rid of them. In general, it requires an understanding of origins of the data you use.
❓ Your prediction service has undeclared consumers (aka visibility debt)
❗ Any changes to your model probably will break these silently dependent systems.
❗ It may create a hidden input loop if an undeclared consumer is creating an input data for your model.
✅ Use automated feature management tool to annotate data sources and build dependency trees. I believe authors were describing feature store concept before it became widespread.
✅💡 Make your service private so any consumer within your organization would have to inform you about intentions to use your model output.
✅💡 Support an old version of your prediction service for some time and make announcements long before any changes.
❓ Your model relies on unstable data source (e.g. another model)
❗ Changes in input data source may cause unexpected behavior of your model.
✅ Create a versioned copy of an unstable input data and use it until the updated version is fully stabilized.
✅ Add more data to teach the first ML-model dealing with your use-case.
✅💡Use input features from the first model to train your own model.
❓ You are using features with slim to none performance impact
❗ The more features you have the higher risk that any of them will alter and corrupt your model performance.
✅ Regularly evaluate the effect of removing individual features from a model.
✅ Develop cultural awareness about the lasting benefit of underutilized dependency cleanup.
❓ You have a lot of “glue code” because of a specific ML-package
❗ It may turn into a tax on innovation: switching to other machine learning package would become very expensive.
✅ Re-implement algorithms from a general-purpose package to satisfy your specific needs. This may look costly, but sometimes it is the easiest solutions in terms of understanding, testing and maintaining your code. For example my team implemented a common interface for all data transformers and rewrites a code from general-purpose packages like sklearn to suit this interface.
❓ Data preparation stages turned into pipeline jungles
❗ Complicated pipelines are difficult to test and maintain.
✅ Do not separate researchers and engineers, they should work together and probably be one person.
✅💡 Use data engineering/MLOps tools which have become popular nowdays. My team is using Airflow and DVC in almost every project which helps us easily manage our data pipelines.
❓ You are mixing dead experimental code with a working one
❗ Unused code paths increase system complexity causing a whole range of negative effects from difficulties in maintenance to unexpected behavior of the system.
✅ Build a healthy ML system which isolates experimental code well. E.g., DVC encourages a usage of separate branches for separate experiments. By doing so, you would nip this problem in the bud.
❓ Your configuration files are very complex
❗ Errors in configuration files are a common source of costly mistakes because they are usually not tested properly and treated lightly by engineers.
✅ Validate data passed via configs using assertions, e.g. pydantic may help with that.
✅ Carefully review changes in configuration files.
❓ You have chosen a threshold for your model manually
❗ This threshold may become invalid if a model is retrained on new data.
✅ Let your ML system to learn a threshold on holdout data.
❓ Your model relies on non-causal correlations
❗ Non-causal correlations may occur randomly or temporarily, so it is extremely risky to rely on them.
✅ Avoid using illogically correlated features. Lucky for us, nowadays, a field of causal and explainable ML is developing rapidly.
✅💡Check that introduced features affect the results in a way you can explain, e.g. you may use a combination of domain knowledge and Shapley Values for that.
❓ Your ML-system monitoring and testing require improvements
❗ Unit tests and end-to-end tests are unable to uncover changes in the external world that affect your model behavior.
✅ Monitor prediction bias (aka concept drift).
✅ Add sanity checks especially for the systems allowed to perform actions in the real world.
✅💡 Monitor other useful metrics. Here are parameters my team usually monitors in every project: input data distribution, prediction distribution, overall metrics, metrics on some slices of data, feature importance, assertions for edge-cases (e.g. if all features are zero we expect prediction to be zero).
I hope you have found this article helpful! Do not let technical debt cut down an innovation rate of your ML-projects!
If notice an error or just in a mood to say hello, please contact me via LinkedIn, Telegram or email. Every message counts, your feedback really motivates me for creation of a new content! Also, you are very welcome to subscribe to my Telegram channel: @FuriousAI.