Unit-tests for machine learning pipelines

Data Scientists usually write code in notebooks, that is then converted to structured and robust modules. Testing your code is then fundamental to minimize disruptions and deploy new features in a more stable manner. How can Data Scientists write tests for their code efficiently?

Data Scientists usually write code in notebooks, that are an excellent interactive environment for the experimental work that is creating a machine learning model but are not usually deployed to production. In fact, code is typically converted to structured, well-tested and robust modules, and any change to the code-base requires some development time before being integrated into the production pipeline. In this setting, testing your code is fundamental to minimize disruptions and deploy new features in a more stable manner. The first level of tests of any software are usually unit-tests, that are tests designed to check that a single piece of code is working correctly and produces the desired results. These tests are usually automatically executed and are well known by software developers, but not by Data Scientists. In this talk, we will present guidelines on how a Data Scientist can use PyTest to develop unit-tests for their code, focusing on a standard machine learning pipeline structure (preprocessing, feature engineering, training, predicting). We will start from how to structure the code in the experimental phase to quickly convert notebooks into structured modules. We will go into details of what are the general steps of a unit tests, and we will investigate the most common scenarios for a test in a data science pipeline.

Unit-tests for machine learning pipelines

Friday, May 26

11:00 - 11:30

Alessandro Garavaglia

Stay tuned!