Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve docs with highlighted features and new figures #16

Merged
merged 6 commits into from
Sep 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 21 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,21 +47,31 @@ Table of Contents

## Features

- **Broad Range of Operators**: Equipped with 50+ core [operators (OPs)](docs/Operators.md), including Formatters, Mappers, Filters, Deduplicators, and beyond.
![Overview](docs/imgs/overview.png)

- **Specialized Toolkits**: Feature-rich specialized toolkits such as [Text Quality Classifier](tools/quality_classifier/README.md), [Dataset Splitter](tools/preprocess/README.md), [Analysers](#data-analysis), [Evaluators](tools/evaluator/README.md), and more that elevate your dataset handling capabilities.
- **Systematic & Reusable**:
Empowering users with a systematic library of 20+ reusable [config recipes](configs), 50+ core [OPs](docs/Operators.md), and feature-rich
dedicated [toolkits](#documentation), designed to
function independently of specific LLM datasets and processing pipelines.

- **Systematic & Reusable**: Empowering users with a systematic library of reusable [config recipes](configs) and [OPs](docs/Operators.md), designed to function independently of specific datasets, models, or tasks.
- **Data-in-the-loop**: Allowing detailed data analyses with an automated
report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
![Data-in-the-loop](docs/imgs/feedback_loop.png)

- **Data-in-the-loop**: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with timely multi-dimension automatic evaluation capabilities, it supports a feedback loop at multiple stages in the LLM development process.
- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
processing recipes](configs/data_juicer_recipes/README.md) for
pre-training, post-tuning, en, zh, and more scenarios. Validated on
reference LLaMA models.
![exp_llama](docs/imgs/exp_on_llama.png)

- **Comprehensive Processing Recipes**: Offering tens of [pre-built data processing recipes](configs/data_juicer_recipes/README.md) for pre-training, SFT, en, zh, and more scenarios.
- **Enhanced Efficiency**: Providing a speedy data processing pipeline
requiring less memory and CPU usage, optimized for maximum productivity.
![sys-perf](docs/imgs/sys_perf.png)

- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).

- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing.

- **Enhanced Efficiency**: Providing a speedy data processing pipeline requiring less memory, optimized for maximum productivity.
- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).

## Prerequisites

Expand Down Expand Up @@ -193,8 +203,8 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
## Data Recipes
- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
- [Recipes for data process in RedPajama](configs/redpajama/README.md)
- [Refined recipes for pretraining data](configs/data_juicer_recipes/README.md)
- [Refined recipes for SFT data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)
- [Refined recipes for pre-training data](configs/data_juicer_recipes/README.md)
- [Refined recipes for post-tuning data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)

## Demos
- Introduction to Data-Juicer [[ModelScope](https://modelscope.cn/studios/Data-Juicer/overview_scan/summary)]
Expand All @@ -211,8 +221,8 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
- Quality Classifier for CommonCrawl [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]
- Auto Evaluation on [HELM](https://github.com/stanford-crfm/helm) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/auto_evaluation_helm/summary)]
- Data Sampling and Mixture [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_mixture/summary)]
- Data Process Loop [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_loop/summary)]
- Data Process HPO [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_hpo/summary)]
- Data Processing Loop [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_loop/summary)]
- Data Processing HPO [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_hpo/summary)]

## License
Data-Juicer is released under Apache License 2.0.
Expand Down
15 changes: 7 additions & 8 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,21 +46,20 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

## 特点

* **丰富的算子**:内置了 50 多个核心 [算子(OPs)](docs/Operators_ZH.md),包括 Formatters,Mappers,Filters,Deduplicators 等。
![Overview](docs/imgs/overview.png)

* **专业的工具库**:提供功能丰富的专业工具库,例如 [文本质量打分器](tools/quality_classifier/README_ZH.md),[数据分割器](tools/preprocess/README_ZH.md),[分析器](#数据分析),[评估器](tools/evaluator/README_ZH.md) 等,提升您的数据处理能力
* **系统化 & 可复用**:为用户提供系统化且可复用的20+[配置菜谱](configs/README_ZH.md),50+核心[算子](docs/Operators_ZH.md)和专用[工具池](#documentation),旨在让数据处理独立于特定的大语言模型数据集和处理流水线

* **系统化 & 可复用**:为用户提供系统化且可复用的[配置菜谱](configs/README_ZH.md)和[算子库](docs/Operators_ZH.md),旨在让数据处理独立于特定的数据集、模型或任务运行。
* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行及时反馈循环。 ![Data-in-the-loop](docs/imgs/feedback_loop.png)

* **数据反馈回路**:支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合及时多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行反馈循环。
* **全面的数据处理菜谱**:为pre-training、post-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 ![exp_llama](docs/imgs/exp_on_llama.png)

* **全面的处理菜谱**:为预训练、SFT、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。
* **效率增强**:提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](docs/imgs/sys_perf.png)

* **用户友好**:设计简单易用,提供全面的[文档](#documentation)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。

* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。

* **效率增强**:提供高效的数据处理流水线,减少内存占用,提高生产力。

## 前置条件

Expand Down Expand Up @@ -189,7 +188,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md)
* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md)
* [预训练数据增强菜谱](configs/data_juicer_recipes/README_ZH.md)
* [SFT数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)
* [Post-tuning数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)

## 演示样例

Expand Down
Binary file added docs/imgs/exp_on_llama.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/feedback_loop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/sys_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.