diff --git a/README.md b/README.md index 4461c147b..13f7223be 100644 --- a/README.md +++ b/README.md @@ -47,21 +47,31 @@ Table of Contents ## Features -- **Broad Range of Operators**: Equipped with 50+ core [operators (OPs)](docs/Operators.md), including Formatters, Mappers, Filters, Deduplicators, and beyond. +![Overview](docs/imgs/overview.png) -- **Specialized Toolkits**: Feature-rich specialized toolkits such as [Text Quality Classifier](tools/quality_classifier/README.md), [Dataset Splitter](tools/preprocess/README.md), [Analysers](#data-analysis), [Evaluators](tools/evaluator/README.md), and more that elevate your dataset handling capabilities. +- **Systematic & Reusable**: + Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich + dedicated [toolkits](#documentation), designed to + function independently of specific LLM datasets and processing pipelines. -- **Systematic & Reusable**: Empowering users with a systematic library of reusable [config recipes](configs) and [OPs](docs/Operators.md), designed to function independently of specific datasets, models, or tasks. +- **Data-in-the-loop**: Allowing detailed data analyses with an automated + report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process. + ![Data-in-the-loop](docs/imgs/feedback_loop.png) -- **Data-in-the-loop**: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with timely multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process. +- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data + processing recipes](configs/data_juicer_recipes/README.md) for + pre-training, post-tuning, en, zh, and more scenarios. Validated on + reference LLaMA models. + ![exp_llama](docs/imgs/exp_on_llama.png) -- **Comprehensive Processing Recipes**: Offering tens of [pre-built data processing recipes](configs/data_juicer_recipes/README.md) for pre-training, SFT, en, zh, and more scenarios. +- **Enhanced Efficiency**: Providing a speedy data processing pipeline + requiring less memory and CPU usage, optimized for maximum productivity. + ![sys-perf](docs/imgs/sys_perf.png) -- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml). - **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing. -- **Enhanced Efficiency**: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity. +- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml). ## Prerequisites @@ -193,8 +203,8 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang ## Data Recipes - [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md) - [Recipes for data process in RedPajama](configs/redpajama/README.md) -- [Refined recipes for pretraining data](configs/data_juicer_recipes/README.md) -- [Refined recipes for SFT data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset) +- [Refined recipes for pre-training data](configs/data_juicer_recipes/README.md) +- [Refined recipes for post-tuning data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset) ## Demos - Introduction to Data-Juicer [[ModelScope](https://modelscope.cn/studios/Data-Juicer/overview_scan/summary)] @@ -211,8 +221,8 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang - Quality Classifier for CommonCrawl [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)] - Auto Evaluation on [HELM](https://github.com/stanford-crfm/helm) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/auto_evaluation_helm/summary)] - Data Sampling and Mixture [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_mixture/summary)] -- Data Process Loop [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_loop/summary)] -- Data Process HPO [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_hpo/summary)] +- Data Processing Loop [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_loop/summary)] +- Data Processing HPO [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_hpo/summary)] ## License Data-Juicer is released under Apache License 2.0. diff --git a/README_ZH.md b/README_ZH.md index 96fe4f0c1..5d76d925f 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -46,21 +46,20 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM ## 特点 -* **丰富的算子**:内置了 50 多个核心 [算子(OPs)](docs/Operators_ZH.md),包括 Formatters,Mappers,Filters,Deduplicators 等。 +![Overview](docs/imgs/overview.png) -* **专业的工具库**:提供功能丰富的专业工具库,例如 [文本质量打分器](tools/quality_classifier/README_ZH.md),[数据分割器](tools/preprocess/README_ZH.md),[分析器](#数据分析),[评估器](tools/evaluator/README_ZH.md) 等,提升您的数据处理能力。 +* **系统化 & 可复用**:为用户提供系统化且可复用的20+[配置菜谱](configs/README_ZH.md),50+核心[算子](docs/Operators_ZH.md)和专用[工具池](#documentation),旨在让数据处理独立于特定的大语言模型数据集和处理流水线。 -* **系统化 & 可复用**:为用户提供系统化且可复用的[配置菜谱](configs/README_ZH.md)和[算子库](docs/Operators_ZH.md),旨在让数据处理独立于特定的数据集、模型或任务运行。 +* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行及时反馈循环。 ![Data-in-the-loop](docs/imgs/feedback_loop.png) -* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合及时多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行反馈循环。 +* **全面的数据处理菜谱**:为pre-training、post-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 ![exp_llama](docs/imgs/exp_on_llama.png) -* **全面的处理菜谱**:为预训练、SFT、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 +* **效率增强**:提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](docs/imgs/sys_perf.png) * **用户友好**:设计简单易用,提供全面的[文档](#documentation)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。 - + * **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。 -* **效率增强**:提供高效的数据处理流水线,减少内存占用,提高生产力。 ## 前置条件 @@ -189,7 +188,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang * [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md) * [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md) * [预训练数据增强菜谱](configs/data_juicer_recipes/README_ZH.md) -* [SFT数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集) +* [Post-tuning数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集) ## 演示样例 diff --git a/docs/imgs/exp_on_llama.png b/docs/imgs/exp_on_llama.png new file mode 100644 index 000000000..d8fdc4c9c Binary files /dev/null and b/docs/imgs/exp_on_llama.png differ diff --git a/docs/imgs/feedback_loop.png b/docs/imgs/feedback_loop.png new file mode 100644 index 000000000..bac4b9b01 Binary files /dev/null and b/docs/imgs/feedback_loop.png differ diff --git a/docs/imgs/overview.png b/docs/imgs/overview.png new file mode 100644 index 000000000..820454bcf Binary files /dev/null and b/docs/imgs/overview.png differ diff --git a/docs/imgs/sys_perf.png b/docs/imgs/sys_perf.png new file mode 100644 index 000000000..32d8ca3b2 Binary files /dev/null and b/docs/imgs/sys_perf.png differ