Awesome Dataset Distillation

Awesome Dataset Distillation provides the most comprehensive and detailed information on the Dataset Distillation field.

Dataset distillation is the task of synthesizing a small dataset such that models trained on it achieve high performance on the original large dataset. A dataset distillation algorithm takes as input a large real dataset to be distilled (training set), and outputs a small synthetic distilled dataset, which is evaluated via testing models trained on this distilled dataset on a separate real dataset (validation/test set). A good small distilled dataset is not only useful in dataset understanding, but has various applications (e.g., continual learning, privacy, neural architecture search, etc.). This task was first introduced in the paper Dataset Distillation [Tongzhou Wang et al., '18], along with a proposed algorithm using backpropagation through optimization steps. Then the task was first extended to the real-world datasets in the paper Medical Dataset Distillation [Guang Li et al., '19], which also explored the privacy preservation possibilities of dataset distillation. In the paper Dataset Condensation [Bo Zhao et al., '20], gradient matching was first introduced and greatly promoted the development of the dataset distillation field.

In recent years (2022-now), dataset distillation has gained increasing attention in the research community, across many institutes and labs. More papers are now being published each year. These wonderful researches have been constantly improving dataset distillation and exploring its various variants and applications.

This project is curated and maintained by Guang Li, Bo Zhao, and Tongzhou Wang.

How to submit a pull request?

🌐 Project Page
Code
📖 bibtex

Latest Updates

[2024/11/17] BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation (Zheng Zhou et al., 2024) 🌐 📖
[2024/11/10] Fetch and Forge: Efficient Dataset Condensation for Object Detection (Ding Qi et al., NeurIPS 2024) 📖
[2024/11/10] Color-Oriented Redundancy Reduction in Dataset Distillation (Bowen Yuan et al., NeurIPS 2024) 📖
[2024/11/10] Provable and Efficient Dataset Distillation for Kernel Ridge Regression (Yilan Chen et al., NeurIPS 2024) 📖
[2024/11/10] Less is More: Efficient Time Series Dataset Condensation via Two-fold Modal Matching (Hao Miao et al., VLDB 2025) 📖
[2024/10/24] Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios (Kai Wang & Zekai Li et al., 2024) 📖
[2024/10/24] Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation? (Lingao Xiao, et al., NeurIPS 2024) 📖
[2024/10/17] A Label is Worth a Thousand Images in Dataset Distillation (Tian Qin et al., NeurIPS 2024) 📖
[2024/09/27] Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment (Jiawei Du et al., NeurIPS 2024) 📖
[2024/09/27] Towards Model-Agnostic Dataset Condensation by Heterogeneous Models (Jun-Yeong Moon et al., ECCV 2024) 📖

Main

Dataset Distillation (Tongzhou Wang et al., 2018) 🌐 📖

Dataset Quantization

Dataset Quantization (Daquan Zhou & Kai Wang & Jianyang Gu et al., ICCV 2023) 📖
Dataset Quantization with Active Learning based Adaptive Sampling (Zhenghao Zhao et al., ECCV 2024) 📖

Decoupled Distillation

Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective (Zeyuan Yin & Zhiqiang Shen et al., NeurIPS 2023) 🌐 📖
Dataset Distillation in Large Data Era (Zeyuan Yin et al., 2023) 📖
Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching (Shitong Shao et al., CVPR 2024) 📖
On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm (Peng Sun et al., CVPR 2024) 📖
Information Compensation: A Fix for Any-scale Dataset Distillation (Peng Sun et al., ICLR 2024 Workshop) 📖
Elucidating the Design Space of Dataset Condensation (Shitong Shao et al., NeurIPS 2024) 📖
Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment (Jiawei Du et al., NeurIPS 2024) 📖
Curriculum Dataset Distillation (Zhiheng Ma & Anjia Cao et al., 2024) 📖

Multimodal Distillation

Vision-Language Dataset Distillation (Xindi Wu et al., TMLR 2024) 🌐 📖
Low-Rank Similarity Mining for Multimodal Dataset Distillation (Yue Xu et al., ICML 2024) 📖

Self-Supervised Distillation

Self-Supervised Dataset Distillation for Transfer Learning (Dong Bok Lee & Seanie Lee et al., ICLR 2024) 📖
Efficiency for Free: Ideal Data Are Transportable Representations (Peng Sun et al., NeurIPS 2024) 📖
Self-supervised Dataset Distillation: A Good Compression Is All You Need (Muxin Zhou et al., 2024) 📖

Object Detection

Fetch and Forge: Efficient Dataset Condensation for Object Detection (Ding Qi et al., NeurIPS 2024) 📖

Benchmark

DC-BENCH: Dataset Condensation Benchmark (Justin Cui et al., NeurIPS 2022) 🌐 📖
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness (Zongxiong Chen & Jiahui Geng et al., 2023) 📖
DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation (Yifan Wu et al., 2024) 📖
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation (Zheng Zhou et al., 2024) 🌐 📖

Survey

Data Distillation: A Survey (Noveen Sachdeva et al., TMLR 2023) 📖
A Survey on Dataset Distillation: Approaches, Applications and Future Directions (Jiahui Geng & Zongxiong Chen et al., IJCAI 2023) 📖
A Comprehensive Survey to Dataset Distillation (Shiye Lei et al., TPAMI 2023) 📖
Dataset Distillation: A Comprehensive Review (Ruonan Yu & Songhua Liu et al., TPAMI 2023) 📖

Ph.D. Thesis

Data-efficient Neural Network Training with Dataset Condensation (Bo Zhao, The University of Edinburgh 2023) 📖

Workshop

1st CVPR Workshop on Dataset Distillation (Saeed Vahidian et al., CVPR 2024) 🌐

Challenge

The First Dataset Distillation Challenge (Kai Wang & Ahmad Sajedi et al., ECCV 2024) 🌐

Applications

Continual Learning

Reducing Catastrophic Forgetting with Learning on Synthetic Data (Wojciech Masarczyk et al., CVPR 2020 Workshop) 📖
Condensed Composite Memory Continual Learning (Felix Wiewel et al., IJCNN 2021) 📖
Distilled Replay: Overcoming Forgetting through Synthetic Samples (Andrea Rosasco et al., IJCAI 2021 Workshop) 📖
Sample Condensation in Online Continual Learning (Mattia Sangermano et al., IJCNN 2022) 📖
An Efficient Dataset Condensation Plugin and Its Application to Continual Learning (Enneng Yang et al., NeurIPS 2023) 📖
Summarizing Stream Data for Memory-Restricted Online Continual Learning (Jianyang Gu et al., AAAI 2024) 📖

Privacy

Privacy for Free: How does Dataset Condensation Help Privacy? (Tian Dong et al., ICML 2022) 📖
Private Set Generation with Discriminative Information (Dingfan Chen et al., NeurIPS 2022) 📖
No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy" (Nicholas Carlini et al., 2022) 📖
Backdoor Attacks Against Dataset Distillation (Yugeng Liu et al., NDSS 2023) 📖
Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation (Margarita Vinaroz et al., 2023) 📖
Understanding Reconstruction Attacks with the Neural Tangent Kernel and Dataset Distillation (Noel Loo et al., ICLR 2024) 📖
Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective (Ming-Yu Chung et al., ICLR 2024) 📖
Differentially Private Dataset Condensation (Zheng et al., NDSS 2024 Workshop) 📖
Adaptive Backdoor Attacks Against Dataset Distillation for Federated Learning (Ze Chai et al., ICC 2024) 📖

Medical

Soft-Label Anonymous Gastric X-ray Image Distillation (Guang Li et al., ICIP 2020) 📖
Compressed Gastric Image Generation Based on Soft-Label Dataset Distillation for Medical Data Sharing (Guang Li et al., CMPB 2022) 📖
Dataset Distillation for Medical Dataset Sharing (Guang Li et al., AAAI 2023 Workshop) 📖
Communication-Efficient Federated Skin Lesion Classification with Generalizable Dataset Distillation (Yuchen Tian & Jiacheng Wang et al., MICCAI 2023 Workshop) 📖
Importance-Aware Adaptive Dataset Distillation (Guang Li et al., NN 2024) 📖
Image Distillation for Safe Data Sharing in Histopathology (Zhe Li et al., MICCAI 2024) 📖
MedSynth: Leveraging Generative Model for Healthcare Data Sharing (Renuga Kanagavelu et al., MICCAI 2024) 📖
Progressive Trajectory Matching for Medical Dataset Distillation (Zhen Yu et al., 2024) 📖
Dataset Distillation in Medical Imaging: A Feasibility Study (Muyang Li et al., 2024) 📖
Dataset Distillation for Histopathology Image Classification (Cong Cong et al., 2024) 📖

Federated Learning

Federated Learning via Synthetic Data (Jack Goetz et al., 2020) 📖
Distilled One-Shot Federated Learning (Yanlin Zhou et al., 2020) 📖
DENSE: Data-Free One-Shot Federated Learning (Jie Zhang & Chen Chen et al., NeurIPS 2022) 📖
FedSynth: Gradient Compression via Synthetic Data in Federated Learning (Shengyuan Hu et al., 2022) 📖
Meta Knowledge Condensation for Federated Learning (Ping Liu et al., ICLR 2023) 📖
DYNAFED: Tackling Client Data Heterogeneity with Global Dynamics (Renjie Pi et al., CVPR 2023) 📖
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning (Yuanhao Xiong & Ruochen Wang et al., CVPR 2023) 📖
Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments (Rui Song et al., IJCNN 2023) 📖
FedLAP-DP: Federated Learning by Sharing Differentially Private Loss Approximations (Hui-Po Wang et al., 2023) 📖
Federated Virtual Learning on Heterogeneous Data with Local-global Distillation (Chun-Yin Huang et al., 2023) 📖
An Aggregation-Free Federated Learning for Tackling Data Heterogeneity (Yuan Wang et al., CVPR 2024) 📖
Overcoming Data and Model Heterogeneities in Decentralized Federated Learning via Synthetic Anchors (Chun-Yin Huang et al., ICML 2024) 📖
DCFL: Non-IID Awareness Dataset Condensation Aided Federated Learning (Xingwang Wang et al., IJCNN 2024) 📖
Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents (Yuqi Jia & Saeed Vahidian et al., ECCV 2024) 📖

Graph Neural Network

Graph Condensation for Graph Neural Networks (Wei Jin et al., ICLR 2022) 📖
Condensing Graphs via One-Step Gradient Matching (Wei Jin et al., KDD 2022) 📖
Graph Condensation via Receptive Field Distribution Matching (Mengyang Liu et al., 2022) 📖
Kernel Ridge Regression-Based Graph Dataset Distillation (Zhe Xu et al., KDD 2023) 📖
Structure-free Graph Condensation: From Large-scale Graphs to Condensed Graph-free Data (Xin Zheng et al., NeurIPS 2023) 📖
Does Graph Distillation See Like Vision Dataset Counterpart? (Beining Yang & Kai Wang et al., NeurIPS 2023) 📖
Fair Graph Distillation (Qizhang Feng et al., NeurIPS 2023) 📖
CaT: Balanced Continual Graph Learning with Graph Condensation (Liu Yilun et al., ICDM 2023) 📖
Mirage: Model-Agnostic Graph Distillation for Graph Classification (Mridul Gupta & Sahil Manchanda et al., ICLR 2024) 📖
Graph Distillation with Eigenbasis Matching (Yang Liu & Deyu Bo et al., ICML 2024) 📖
Navigating Complexity: Toward Lossless Graph Condensation via Expanding Window Matching (Yuchen Zhang & Tianle Zhang & Kai Wang et al., ICML 2024) 📖
Graph Data Condensation via Self-expressive Graph Structure Reconstruction (Zhanyu Liu & Chaolv Zeng et al., KDD 2024) 📖
Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching (Tianle Zhang & Yuchen Zhang & Kai Wang et al., 2024) 📖

Survey

A Comprehensive Survey on Graph Reduction: Sparsification, Coarsening, and Condensation (Mohammad Hashemi et al., IJCAI 2024) 📖
Graph Condensation: A Survey (Xinyi Gao et al., 2024) 📖
A Survey on Graph Condensation (Hongjia Xu et al., 2024) 📖

Benchmark

GC-Bench: An Open and Unified Benchmark for Graph Condensation (Qingyun Sun & Ziying Chen et al., NeurIPS 2024) 📖
GCondenser: Benchmarking Graph Condensation (Yilun Liu et al., 2024) 📖
GC-Bench: A Benchmark Framework for Graph Condensation with New Insights (Shengbo Gong & Juntong Ni et al., 2024) 📖

No further updates will be made regarding graph distillation topics as sufficient papers and summary projects are already available on the subject

Neural Architecture Search

Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data (Felipe Petroski Such et al., ICML 2020) 📖
Learning to Generate Synthetic Training Data using Gradient Matching and Implicit Differentiation (Dmitry Medvedev et al., AIST 2021) 📖
Calibrated Dataset Condensation for Faster Hyperparameter Search (Mucong Ding et al., 2024) 📖

Fashion, Art, and Design

Wearable ImageNet: Synthesizing Tileable Textures via Dataset Distillation (George Cazenavette et al., CVPR 2022 Workshop) 🌐 📖
Learning from Designers: Fashion Compatibility Analysis Via Dataset Distillation (Yulan Chen et al., ICIP 2022) 📖
Galaxy Dataset Distillation with Self-Adaptive Trajectory Matching (Haowen Guan et al., NeurIPS 2023 Workshop) 📖

Recommender Systems

Infinite Recommendation Networks: A Data-Centric Approach (Noveen Sachdeva et al., NeurIPS 2022) 📖
Gradient Matching for Categorical Data Distillation in CTR Prediction (Chen Wang et al., RecSys 2023) 📖

Blackbox Optimization

Bidirectional Learning for Offline Infinite-width Model-based Optimization (Can Chen et al., NeurIPS 2022) 📖
Bidirectional Learning for Offline Model-based Biological Sequence Design (Can Chen et al., ICML 2023) 📖

Trustworthy

Can We Achieve Robustness from Data Alone? (Nikolaos Tsilivis et al., ICML 2022 Workshop) 📖
Towards Robust Dataset Learning (Yihan Wu et al., 2022) 📖
Rethinking Data Distillation: Do Not Overlook Calibration (Dongyao Zhu et al., ICCV 2023) 📖
Towards Trustworthy Dataset Distillation (Shijie Ma et al., PR 2024) 📖
Group Distributionally Robust Dataset Distillation with Risk Minimization (Saeed Vahidian & Mingyu Wang & Jianyang Gu et al., 2024) 📖
Towards Adversarially Robust Dataset Distillation by Curvature Regularization (Eric Xue et al., 2024) 📖

Text

Data Distillation for Text Classification (Yongqi Li et al., 2021) 📖
Dataset Distillation with Attention Labels for Fine-tuning BERT (Aru Maekawa et al., ACL 2023) 📖
DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation (Aru Maekawa et al., NAACL 2024) 📖

Tabular

New Properties of the Data Distillation Method When Working With Tabular Data (Dmitry Medvedev et al., AIST 2020) 📖

Retrieval

Towards Efficient Deep Hashing Retrieval: Condensing Your Data via Feature-Embedding Matching (Tao Feng & Jie Zhang et al., 2023) 📖

Video

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement (Ziyu Wang & Yue Xu et al., CVPR 2024) 📖

Domain Adaptation

Multi-Source Domain Adaptation Meets Dataset Distillation through Dataset Dictionary Learning (Eduardo Montesuma et al., ICASSP 2024) 📖

Super Resolution

GSDD: Generative Space Dataset Distillation for Image Super-resolution (Haiyu Zhang et al., AAAI 2024) 📖

Time Series

Dataset Condensation for Time Series Classification via Dual Domain Matching (Zhanyu Liu et al., KDD 2024) 📖
CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting (Jianrong Ding & Zhanyu Liu et al., NeurIPS 2024) 📖
Less is More: Efficient Time Series Dataset Condensation via Two-fold Modal Matching (Hao Miao et al., VLDB 2025) 📖

Speech

Dataset-Distillation Generative Model for Speech Emotion Recognition (Fabian Ritter-Gutierrez et al., Interspeech 2024) 📖

Machine Unlearning

Distilled Datamodel with Reverse Gradient Matching (Jingwen Ye et al., CVPR 2024) 📖
Dataset Condensation Driven Machine Unlearning (Junaid Iqbal Khan, 2024) 📖

Reinforcement Learning

Dataset Distillation for Offline Reinforcement Learning (Jonathan Light & Yuanzhe Liu et al., ICML 2024 Workshop) 🌐 📖

Long-Tail

Distilling Long-tailed Datasets (Zhenghao Zhao & Haoxuan Wang et al., 2024) 📖

Media Coverage

Star History

Citing Awesome Dataset Distillation

If you find this project useful for your research, please use the following BibTeX entry.

@misc{li2022awesome,
  author={Li, Guang and Zhao, Bo and Wang, Tongzhou},
  title={Awesome Dataset Distillation},
  howpublished={\url{https://github.com/Guang000/Awesome-Dataset-Distillation}},
  year={2022}
}

Acknowledgments

We would like to express our heartfelt thanks to Nikolaos Tsilivis, Wei Jin, Yongchao Zhou, Noveen Sachdeva, Can Chen, Guangxiang Zhao, Shiye Lei, Xinchao Wang, Dmitry Medvedev, Seungjae Shin, Jiawei Du, Yidi Jiang, Xindi Wu, Guangyi Liu, Yilun Liu, Kai Wang, Yue Xu, Anjia Cao, Jianyang Gu, Yuanzhen Feng, Peng Sun, Ahmad Sajedi, Zhihao Sui, Ziyu Wang, Haoyang Liu, Eduardo Montesuma, Shengbo Gong, Zheng Zhou, Zhenghao Zhao, Duo Su, Tianhang Zheng, Shijie Ma, Wei Wei, Yantai Yang, Shaobo Wang, Xinhao Zhong, Zhiqiang Shen, Cong Cong, Chun-Yin Huang, Dai Liu, and Ruonan Yu for their valuable suggestions and contributions.

The Homepage of Awesome Dataset Distillation was designed and maintained by Longzhen Li.

Name		Name	Last commit message	Last commit date
Latest commit History 1,049 Commits
citations		citations
css		css
images		images
.DS_Store		.DS_Store
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
googled873bf132668a2f1.html		googled873bf132668a2f1.html
index.html		index.html
sitemap.xml		sitemap.xml

License

Guang000/Awesome-Dataset-Distillation

Folders and files

Latest commit

History

Repository files navigation

Awesome Dataset Distillation

How to submit a pull request?

Latest Updates

Contents

Main

Early Work

Gradient/Trajectory Matching Surrogate Objective

Distribution/Feature Matching Surrogate Objective

Kernel-Based Distillation

Distilled Dataset Parametrization

Generative Distillation

Better Optimization

Better Understanding

Label Distillation

Dataset Quantization

Decoupled Distillation

Multimodal Distillation

Self-Supervised Distillation

Object Detection

Benchmark

Survey

Ph.D. Thesis

Workshop

Challenge

Applications

Continual Learning

Privacy

Medical

Federated Learning

Graph Neural Network

Survey

Benchmark

No further updates will be made regarding graph distillation topics as sufficient papers and summary projects are already available on the subject

Neural Architecture Search

Fashion, Art, and Design

Recommender Systems

Blackbox Optimization

Trustworthy

Text

Tabular

Retrieval

Video

Domain Adaptation

Super Resolution

Time Series

Speech

Machine Unlearning

Reinforcement Learning

Long-Tail

Media Coverage

Star History

Citing Awesome Dataset Distillation

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 26

Languages