Great Expectations Pipeline Tests for scikit-learn

This repository is an example for how the Great Expectations library could be integrated within a scikit-learn machine learning pipeline to ensure that data inputs, transformed data, and even the model predictions conform to an expected standard.

Using this folder configuration we have a single Python file (main.py) that runs our entire analysis but interacts with the other folders to read and write artifacts of the analysis. The script contains code that validates data during the run against expectations to ensure the integrity of the analysis.

scikit-learn Pipeline

In scikit-learn there is functionality where you can take multiple "transformers" and chain them together to preprocess data and model it. The main.py file is where these transformers preprocess the raw data from in data/raw-data.csv. Here we will introduce the Great Expectations (ge) library in three key areas to validate assumptions about data in our pipeline:

Use ge to check the raw data
Use ge to check the data after preprocessing (which is right before modeling)
Use ge to check the difference between actuals and predictions on a holdout dataset

First, it is pretty obvious that Great Expectations could be used to validate the inputted raw data. The phrase "garbage in, garbage out" applies here. It does not matter how good our pipeline is if we put bad data into it, so we prevent that from happening.

Second, checking preprocessed data is also important. This ensures that all of your transformers behaved as you would expect in case they encoded data incorrectly or changed in another environment where you load them from, etc. For example, another analyst could tweak a transformer parameter in main.py which is running the analysis and the pipeline tests, if written correctly, should catch the change in parameters of the transformers. This is extremely helpful in ensuring the data you are modeling is what you expect.

Third, we check the model errors on a holdout set. This is also important in ensuring that there are no extremely large errors caused by a drift in the inputs, outliers, or a decline in model performance.

Creating Expectations

To implement the three types of validation described above we started by initializing a Great Expectations project by running the command below in the terminal.

NOTE: If you cloned this repo and are following along you can delete the great_expectations folder and follow the instructions below. We assume you are setting this up in a project where Great Expectations has not already been initialized.

great_expectations init

In the initialization prompts we declined to add a DataSource. By declining the DataSource configuration we were able to quickly and consistently setup multiple DataSources by running the script set-datasources.py. You can always check which DataAssets are available from DataSources by running the commands below in Python. This should show the three DataAssets from the .csv files: 1) raw-data, 2) modeling-data, and 3) holdout-error-data.

import great_expectations as ge
import great_expectations.jupyter_ux
context = ge.data_context.DataContext()
great_expectations.jupyter_ux.list_available_data_asset_names(context)

data_source: data__dir (pandas)
  generator_name: default (subdir_reader)
    generator_asset: raw-data
data_source: output__dir (pandas)
  generator_name: default (subdir_reader)
    generator_asset: modeling-data
    generator_asset: holdout-error-data

The next step is creating expectations for these three DataAssets. The expectations were created by the following scripts:

Raw data: ./great_expectations/notebooks/create-raw-data-expectations.py
Modeling data: ./great_expectations/notebooks/create-modeling-data-expectations.py
Holdout error data: ./great_expectations/notebooks/create-holdout-error-data-expectations.py

All of the expectation creation scripts follow a similar pattern where we first add the BasicProfiler suite as an expectation and then create our expectations as the "default" suite for the DataAsset. Those default expectations are created by loading the data from the folder as a Batch and only need to be done once for the first time or when you are updating the expectations.

The choice to use .py scripts instead of notebooks is purely for personal preference. The scripts are stored in the notebooks folder ./great_expectations/notebooks for reference just like a notebook would be.

Running Your Analysis

In the sections above we described the input data and how we created expectations to validate data at three different points of the analysis (before, during, and after modeling). There is a script called main.py which holds the full end-to-end analysis. When running the script you should see something like:

$ python main.py

Successfully validated raw data.
Successfully validated modeling data.
Successfully validated holdout error data.

In this script there are sections which validate the data against the created expectations. At the end of each section there is an assert check that the validation run was successful. The script will generate an AssertionError if any of the expectations are not met. If the run was successful you should see the validations of the run stored in uncommitted/validations of your Great Expectations folder.

Demo Scenarios

The main.py script supports three scenarios to demonstrate how Great Expectations can identify changes in your pipeline. The first scenario is an example where the raw data is not what you expect because it is missing a column. You can run that scenario like this:

$ python main.py missing-column

The following raw data expectations failed:
{'expectation_type': 'expect_table_columns_to_match_ordered_list', 'kwargs': {'column_list': ['species', 'color', 'beak_ratio', 'claw_length', 'wing_density'
, 'weight']}}
{'expectation_type': 'expect_column_values_to_be_of_type', 'kwargs': {'column': 'species', 'type_': 'str'}}
{'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'species'}}
{'expectation_type': 'expect_column_values_to_be_in_set', 'kwargs': {'column': 'species', 'value_set': ['avis', 'ales']}}
Traceback (most recent call last):
  File "main.py", line 55, in <module>
    assert validation_result_raw_dat["success"]
AssertionError

The second scenario is one where a pickled scikit-learn transformer is loaded. The transformer is supposed to have been created using 2 quantile bins. However, this transformer was created using 4 quartile bins so the preprocessed raw data takes values [0.0, 1.0, 2.0, 3.0] instead of the expected value set [0.0, 1.0].

$ python main.py different-transformer

Successfully validated raw data.
The following modeling data expectations failed:
{'expectation_type': 'expect_column_values_to_be_in_set', 'kwargs': {'column': 'V4', 'value_set': [0.0, 1.0]}}
Traceback (most recent call last):
  File "main.py", line 139, in <module>
    assert validation_result_modeling_dat["success"]
AssertionError

In the third scenario we make a small change to the holdout data making one observation an extremely large outlier (999.99) which results in a prediction error of more than 100 (one of of expectations).

$ python main.py holdout-outlier

Successfully validated raw data.
Successfully validated modeling data.
The following holdout error data expectations failed:
{'expectation_type': 'expect_column_values_to_be_between', 'kwargs': {'column': 'error', 'min_value': -100, 'max_value': 100}}
Traceback (most recent call last):
  File "main.py", line 194, in <module>
    assert validation_result_holdout_error_dat["success"]
AssertionError

Top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Great Expectations Pipeline Tests for scikit-learn

Table of Contents

Background

Project Folder Structure

scikit-learn Pipeline

Creating Expectations

Running Your Analysis

Demo Scenarios

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
great_expectations		great_expectations
lib		lib
output		output
scenarios		scenarios
README.md		README.md
main.py		main.py

StevenMMortimer/ge-sklearn-pipeline-example

Folders and files

Latest commit

History

Repository files navigation

Great Expectations Pipeline Tests for scikit-learn

Table of Contents

Background

Project Folder Structure

scikit-learn Pipeline

Creating Expectations

Running Your Analysis

Demo Scenarios

About

Topics

Resources

Stars

Watchers

Forks

Languages