Awesome open data-centric AI

Open source tooling for data-centric AI on unstructured data

Data-centric AI (DCAI) is a development paradigm for ML-based solutions. The term was coined by Andrew Ng who gave the following definition:

Data-centric AI is the practice of systematically engineering the data used to build AI systems.

At Renumics, we believe DCAI is an important puzzle piece for building real-world AI systems that generate value. We like the following definition:

Data-centric AI means to improve training datasets systematically and iteratively by leveraging information from trained ML models.

Tools that can be efficiently used in day-to-day applications are the most important ingredient for the DCAI paradigm. This curated link collection is intended to help you discover useful open source tools for your data-centric AI workflows.

🔎 Scope

We include useful tools that have an open-source license and are actively maintained in this collection. All tools mentioned are useful for building DCAI workflows on unstructured data (e.g. images, audio, video, time-series, text).

In order to keep a useful focus and to prevent duplicate work, we exclude the following topics:

DCAI tools for tabular data. There is an awesome list for that maintained by the Ydata team.
Labeling tools. Although labeling is part of the DCAI workflow, we refer to the awesome list of the ZenML team on that topic.
MLOps tooling. There are many gray areas between MLOps and DCAI and some distinctions have yet to be made. We exclude all topics that are clearly out of the DCAI scope (e.g. AutoML, serving, orchestration etc.).

👐 Contributing

Do you think something is missing? Please help contribute to this list by contacting us or adding a pull request.

Data versioning

Logo	Name	Description	Popularity	License
	Data version control (DVC)	Data Version Control or DVC is a command line tool and VS Code Extension to help you develop reproducible machine learning projects.
	deeplake	Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets.
	Pachyderm	Pachyderm – Automate data transformations with data versioning and lineage.
	Git Large File Storage	Git LFS is a command line extension and specification for managing large files with Git.
	lakeFS	lakeFS is an open-source tool that transforms your object storage into a Git-like repository.

Embeddings and pre-trained models

Logo	Name	Description	Popularity	License
	towhee	Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
	Tensorflow Hub	TensorFlow Hub is a repository of reusable assets for machine learning with TensorFlow.
	Huggingface transformers	State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
	Lightly	Lightly is a computer vision framework for self-supervised learning.

Visualization and Interaction

Logo	Name	Description	Popularity	License
	Renumics Spotlight	Curation tool for unstructured data that connects your stack to the data-centric AI ecosystem.
	FiftyOne	The open-source tool for building high-quality datasets and computer vision models.
	refinery	The data scientist's open-source choice to scale, assess and maintain natural language data.
	Argilla	Argilla helps domain experts and data teams to build better NLP datasets in less time.
	Xtreme1	Xtreme1 is the world's first open-source platform for multisensory training data.
	Holmes Extractor	Holmes supports a number of use cases involving information extraction from English and German texts.

Outlier and noise detection

Logo	Name	Description
	Cleanlab	Cleanlab facilitates machine learning with messy, real-world data by providing clean labels for robust training and flagging errors in your data.
PyOD	PyOD	A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)
	TODS	An full-stack automated time-series outlier detection system.
	Alibi Detect	Algorithms for outlier, adversarial and drift detection.

Explainability

Logo	Name	Description
	SHAP	A game theoretic approach to explain the output of any machine learning model.
	Alibi	Alibi is an open source Python library aimed at machine learning model inspection and interpretation.
LIME	LIME	Explaining the predictions of any machine learning classifier.
	Captum	Model interpretability and understanding for PyTorch.

Active learning

Logo	Name	Description	Popularity	License
	modAL	A modular active learning framework for Python.
	Bayesian Active Learning (Baal)	Library to enable Bayesian active learning in your research or labeling work.

Uncertainty quantification

Logo	Name	Description	Popularity	License
	Uncertainty Toolbox	A Python toolbox for predictive uncertainty quantification, calibration, metrics, and visualization.
	MAPIE	A scikit-learn-compatible module for estimating prediction intervals.

Bias and fairness

Logo	Name	Description	Popularity	License
	AIF360	The AI Fairness 360 toolkit helps to detect and mitigate bias in machine learning models throughout the AI application lifecycle.
	Fairlearn	A Python package to assess and improve fairness of machine learning models.

Drift detection

Logo	Name	Description	Popularity	License
	Deepchecks	Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.
	Evidently	An open-source framework to evaluate, test and monitor ML models in production.

Augmentation and synthetic data

Logo	Name	Description	Popularity	License
	Albumentations	Fast image augmentation library and an easy-to-use wrapper around other libraries.
	Gretel Synthetics	Synthetic data generators for structured and unstructured text, featuring differentially private learning.
	SDV	Synthetic Data Generation for tabular, relational and time series data.

Adversarial Robustness

Logo	Name	Description	Popularity	License
	CleverHans	An adversarial example library for constructing attacks, building defenses, and benchmarking both.
	Adversarial Robustness Toolbox	Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams.
	Foolbox	Foolbox is a Python library that lets you easily run adversarial attacks against machine learning models like deep neural networks.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
static/img		static/img
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome open data-centric AI

🔎 Scope

📒 Contents

👐 Contributing

Data versioning

Embeddings and pre-trained models

Visualization and Interaction

Outlier and noise detection

Explainability

Active learning

Uncertainty quantification

Bias and fairness

Drift detection

Augmentation and synthetic data

Adversarial Robustness

About

Releases

Packages

License

jrgryph/awesome-open-data-centric-ai

Folders and files

Latest commit

History

Repository files navigation

Awesome open data-centric AI

🔎 Scope

📒 Contents

👐 Contributing

Data versioning

Embeddings and pre-trained models

Visualization and Interaction

Outlier and noise detection

Explainability

Active learning

Uncertainty quantification

Bias and fairness

Drift detection

Augmentation and synthetic data

Adversarial Robustness

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages