Data-Centric What-If Analysis for Native Machine Learning Pipelines.
This project uses the mlinspect project as a foundation, mainly for its plan extraction from native ML pipelines.
Prerequisite: Python 3.9
-
Clone this repository (optionally, with Git LFS, to also download the datasets for the scalability experiment)
-
Set up the environment
cd mlwhatif
python -m venv venv
source venv/bin/activate
-
If you want to use the visualisation functions we provide, install graphviz which can not be installed via pip
Linux:
apt-get install graphviz
MAC OS:
brew install graphviz
-
Install pip dependencies
SETUPTOOLS_USE_DISTUTILS=stdlib pip install -e ."[dev]"
-
To ensure everything works, you can run the tests (without graphviz, the visualisation test will fail)
python setup.py test
mlwhatif makes it easy to analyze your pipeline and automatically run what-if analyses.
from mlwhatif import PipelineAnalyzer
from mlwhatif.analysis import DataCleaning, ErrorType
IPYNB_PATH = ...
cleanlearn = DataCleaning({'category': ErrorType.CAT_MISSING_VALUES,
'vine': ErrorType.CAT_MISSING_VALUES,
'star_rating': ErrorType.NUM_MISSING_VALUES,
'total_votes': ErrorType.OUTLIERS,
'review_id': ErrorType.DUPLICATES,
None: ErrorType.MISLABEL
})
analysis_result = PipelineAnalyzer \
.on_pipeline_from_ipynb_file(IPYNB_PATH)\
.add_what_if_analysis(cleanlearn) \
.execute()
cleanlearn_report = analysis_result.analysis_to_result_reports[cleanlearn]
We prepared a demo notebook to showcase mlwhatif and its features.
- For debugging in PyCharm, set the pytest flag
--no-cov
(Link) - If you want to see log output in PyCharm, you can also set the pytest flags
--log-cli-level=10 -s
. The-s
is needed because otherwise pytest breaks the stdout capturing.
- Stefan Grafberger, Shubha Guha, Paul Groth, Sebastian Schelter (2023). mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses Over and Over? VLDB (demo).
- Stefan Grafberger, Paul Groth, Sebastian Schelter (2023). Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines. ACM SIGMOD.
- Stefan Grafberger, Paul Groth, Sebastian Schelter (2022). Towards Data-Centric What-If Analysis for Native Machine Learning Pipelines. Data Management for End-to-End Machine Learning workshop at ACM SIGMOD.
This library is licensed under the Apache 2.0 License.