Building your ML pipeline

ML in real life

Technical debt in ML systems. Source: towardsdatascience.com

From Sculley, David, et al. “Hidden technical debt in machine learning systems.” Advances in neural information processing systems. 2015.

Code organization - 1

Separate the data collection tool from the ML pipeline
- Make them two separate projects
Your tools should have clear input parameters
- e.g., Path to the repository
- The command line tool should not work if input parameters are wrong
Make config parameters very clear
- A config.py file where people can tune specific configs
- If you use environment variables, document them clearly
- Do not use hard coded paths

Code organization - 2

Write JavaDoc/Pydoc documentation
Add a ‘TEST’ config which runs your entire pipeline in a smaller scale
It should be easy to reproduce your entire pipeline
- Write documentation
- Provide Bash scripts and/or Makefiles
Write assertions to make sure your data is always consistent
- e.g., assert df.count() > 0, “Data frame is empty”
- to check dimensions after transformations

Code organization - Brainstorm 1

Define a list of package/library requirements
- requirements.txt
Think of performance!
- Make sure you use the right data types for your data, e.g., smaller types are usually faster and more memory-efficient.
- Delete all in-memory data that is not necessary (e.g., data in the tensors/dataframes, after the model has been trained)
Use Git! And version your code (and your data) in the right way.
- Use venv (virtual environments)

Code organization - Brainstorm 2

Set the global seed in a variable, to increase reproducibility.
- Use the same seed across different models to make the comparison more ‘fair’
Think of logging: what do you want to monitor?
- Explore the ‘TensorBoard’ library
Think of how you are going to share your final model
- A Docker image, ready to be used, is a good choice

Code organization - Brainstorm 3

During development, consider storing intermediate steps.
- That will facilitate debugging and speed up the process.
Understand the importance of the data you are passing to your model
- Pay attention to “garbage-in garbage-out”