ML in real life
Technical debt in ML systems. Source: towardsdatascience.com
From Sculley, David, et al. “Hidden technical debt in machine learning systems.” Advances in neural information processing systems. 2015.
Code organization - 1
- Separate the data collection tool from the ML pipeline
- Make them two separate projects
- Your tools should have clear input parameters
- e.g., Path to the repository
- The command line tool should not work if input parameters are wrong
- Make config parameters very clear
- A config.py file where people can tune specific configs
- If you use environment variables, document them clearly
- Do not use hard coded paths
Code organization - 2
Write JavaDoc/Pydoc documentation
Add a ‘TEST’ config which runs your entire pipeline in a smaller scale
It should be easy to reproduce your entire pipeline
- Write documentation
- Provide Bash scripts and/or Makefiles
Write assertions to make sure your data is always consistent
- e.g., assert df.count() > 0, “Data frame is empty”
- to check dimensions after transformations
Code organization - Brainstorm 1
- Define a list of package/library requirements
- Think of performance!
- Make sure you use the right data types for your data, e.g., smaller types are usually faster and more memory-efficient.
- Delete all in-memory data that is not necessary (e.g., data in the tensors/dataframes, after the model has been trained)
- Use Git! And version your code (and your data) in the right way.
- Use venv (virtual environments)
Code organization - Brainstorm 2
- Set the global seed in a variable, to increase reproducibility.
- Use the same seed across different models to make the comparison more ‘fair’
- Think of logging: what do you want to monitor?
- Explore the ‘TensorBoard’ library
- Think of how you are going to share your final model
- A Docker image, ready to be used, is a good choice
Code organization - Brainstorm 3
- During development, consider storing intermediate steps.
- That will facilitate debugging and speed up the process.
- Understand the importance of the data you are passing to your model
- Pay attention to “garbage-in garbage-out”