Talk

Unit-tests for machine learning pipelines

Friday, May 26

11:00 - 11:30
RoomLasagna
LanguageEnglish
Audience levelBeginner
Elevator pitch

Data Scientists usually write code in notebooks, that is then converted to structured and robust modules. Testing your code is then fundamental to minimize disruptions and deploy new features in a more stable manner. How can Data Scientists write tests for their code efficiently?

Abstract

Data Scientists usually write code in notebooks, that are an excellent interactive environment for the experimental work that is creating a machine learning model but are not usually deployed to production. In fact, code is typically converted to structured, well-tested and robust modules, and any change to the code-base requires some development time before being integrated into the production pipeline. In this setting, testing your code is fundamental to minimize disruptions and deploy new features in a more stable manner. The first level of tests of any software are usually unit-tests, that are tests designed to check that a single piece of code is working correctly and produces the desired results. These tests are usually automatically executed and are well known by software developers, but not by Data Scientists. In this talk, we will present guidelines on how a Data Scientist can use PyTest to develop unit-tests for their code, focusing on a standard machine learning pipeline structure (preprocessing, feature engineering, training, predicting). We will start from how to structure the code in the experimental phase to quickly convert notebooks into structured modules. We will go into details of what are the general steps of a unit tests, and we will investigate the most common scenarios for a test in a data science pipeline.

TagsMachine-Learning, Test Driven Development (TDD)
participant photo

Alessandro Garavaglia

My name is Alessandro Garavaglia. I have a Ph.D. in Applied Mathematics from Eindhoven University of Technology, with a dissertation about stochastic models for complex networks. I worked in consulting both as a Data Scientist and as Cloud Developer, with a focus on DevOps and MLOps development. I jcurrently work as a Senior Machine Learning Engineer, as well as architect and administrator of a Data Science Platform.