During this talk, we will introduce and showcase great_expectations, a Python library to write unit tests for data pipelines. We will also discuss what it means to make the creation of data tests more efficient through data profiling.
During this talk, we will introduce and showcase great_expectations, a Python library to write unit tests and documentation for robust data pipelines.
We will explain the basic concepts behind the library with some practical coding examples. We will also discuss what it means to make the creation of data tests more efficient through data profiling, sharing the experience gained during the development of the StructuredDataProfiling library.
But why do we need to create data tests and data docs? Any modern software application heavily relies on data pipelines. These data pipelines usually undergo several transformations defined and implemented by different stakeholders.
Such complexity usually translates into technical debt, meaning it’s bound to cause technical headaches as projects scale up. How can we prevent this from happening? Practically, we can apply the same concepts we use in software development (such as CI/CD) to data pipelines. A big chunk of this work translates to applying two essential software practices: version control and automated testing. The former is needed to ensure that data pipelines are reproducible, and we can revert to the most recent working pipeline if something goes wrong. The latter provides we can continuously test data pipelines.
In this context, great_expectations’ objective is to define an open standard for data quality compatible with existing CI/CD pipelines. Expectations are declarative statements that a computer can evaluate that are semantically meaningful to humans. An expectation could be, for example, ‘the sum of columns a and b should be equal to one’ or ‘the values in column c should be non-negative.’
Defining such expectations at different steps of data pipelines means we can test them more efficiently while maintaining high-quality data documentation. The developers’ vision is to make it a reference standard for data quality.
In this talk, we will present some practical examples of data expectations, and we will also explore the topic of automatic data profiling, i.e., automatically finding expectations characterizing our data. Data profiling corresponds to automatically identifying data tests to measure data quality consistently and quantitatively. We will finally discuss how we gathered several heuristic checks for data profiling and packaged them into an open-source project, StructuredDataProfiling.
Luca is one of the co-founders and CTO of Clearbox AI, a startup offering synthetic data solutions. He holds a PhD in computational mathematics from the Delft University of Technology and worked for several years as a scientific software developer in the Netherlands, serving clients in the safety-critical industries in the United States, South Africa, and Germany. He likes to hike and experiment with hydroponic farming during his free time.