Authors: Yiding Sun*, Feng Wang*, Yutao Zhu, Wayne Xin Zhao, Jiaxin Mao
[Paper] [Github Repo]
🚀 [2024/07/18] We have an offline poster presentation in SIGIR 2024 conference.
🚀 [2024/03/26] Our Paper has been accepted by the Demo paper track of SIGIR 2024 conference!
📖 [2024/03/25] Our project has been renamed to Yulan-GARDEN from ZHEM. Yulan-GARDEN signifies that Yulan series LLMs (Yulan-LLM and Yulan-Chat) was brought up by Yulan-GARDEN (data produced by Yulan-GARDEN).
The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data.
In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources and develop dedicated data cleansing pipeline for each data repository.
Lacking a unified data processing framework, this process is repetitive and cumbersome.
To mitigate this issue, we propose this data processing framework Yulan-GARDEN (an inteGrated dAta pRocessing framework for founDation modEls pretraiNing) that integrates a Processing Module which consists of a series of operators at different granularity levels, and an Analyzing Module which supports probing and evaluation of the refined data. Instead of Code, we first introduce how to use this framework with some example use cases and then demonstrate its effectiveness in improving the data quality with an automated evaluation with ChatGPT and an end-to-end evaluation in pretraining the GPT-2 model.
More Details an be found in our Paper.
First, install the requirement packages declared by requirements.txt
.
Modify settings/example.json on your desired processing procedure. The configuration file may contains many options. For example, if you want to use regular expressions to match some dirty tokens, you should add your own regular expressions into re_list
of rm_re_rules
.
Copy raw data to input_path in settings/example.json
. And make sure the output_path not exists, otherwise the output path will be overwritten.
pip install -r requirements.txt
cp <raw_data> /path/to/input_path
Then the processing pipeline will generate an debug_report.json
into /path/to/report_path defined in settings/example.json
. The report displayed filter ratio of each filter rules, as well as match ratio, time cost and match cases of each regular expression cleaner rules and so on.
python main.py --conf settings/example.json
After running the processing pipeline, the cleaned data will be merged as a .jsonl
file, while there is just one record in a line. The record is in the form of json
, with the field text
, representing the refined text.
An example of setttings is settings/example.json
, which includes comments of paramaters. Further details are in utils/workers/README.md.
Yulan-GARDEN framework consists of two main modules, the Processing Module and Analyzing Module. The Usage Details can be found README.md in corresponding directories.
Processing Module can refine the raw datasets by unifying different formats, filtering and denosing irrelevant information, and deduplicating. The Processing Module consists of four components: Reformatter, Filter, Cleaner, and Deduplicator. The details of each component can be found here.
Analyzing Module helps users facilitate a more profound comprehension of datasets through statistics analysis, specific domain knowledge retrieval, and parameter analysis of Filter and match cases of Cleaner. There are three components in the Analyzing Module, including Evaluator, Retriever, and Debugger.
We conduct two experiments to validate the effectiveness of Yulan-GARDEN in improving the data quality. Through automated evaluation with ChatGPT, we observe a significant improvement in the quality of refined dataset when applied to OpenWebText2, Wikipedia, and HackerNews. In the end-to-end evaluation, we train two GPT-2 models using CommonCrawl before and after processing, respectively. The model trained on the refined data demonstrates remarkable performance enhancement across downstream tasks of language modeling compared to the baseline.
We adopt ChatGPT as an automated evaluation tool for data quality, utilizing its powerful ability of instruction following. We evaluate which data~(i.e., before and after processing) is more suitable for training LLMs. Specifically, ChatGPT is prompted to consider various dimensions such as text formatting, fluency, coherence, and informativeness. Data pairs that are identical before and after processing, and exceed the ChatGPT's context length limitation are excluded.
The experimental results are shown in table. On all datasets, our processed dataset significantly outperforms the raw dataset in terms of data quality.
Datasets | #Win | #Lose | #Tied |
---|---|---|---|
Openwebtext2 | 338 | 162 | 0 |
Wikipedia(en) | 333 | 161 | 6 |
HackerNews | 382 | 112 | 6 |
To evaluate the data quality more intuitively, we trained a GPT-2 model using the raw and refined data respectively, denoted as GPT-2-raw and GPT-2-ref. We evaluated the language modeling capabilities of LLMs on other corpora in an end-to-end manner.
The evaluation results are shown as follows. Notably, GPT-2-ref achieves the same loss as GPT-2-raw after only 0.25M steps, whereas GPT-2-raw reaches that level after 2M steps. Across all datasets, GPT-2-ref exhibits superior performance compared to GPT-2-raw. Additionally, the PPL of GPT-2-ref on the LAMBADA and WikiText103 datasets demonstrate a noticeable trend of faster decrease compared to GPT-2-raw in the initial stage. It indicates that utilizing our proposed data processing framework enhances both the efficiency and effectiveness of training foundation models.
Models | LAMBADA(PPL) | WikiText103(PPL) | 1BW(PPL) | CBT-CN(ACC) | CBT-NE(ACC) |
---|---|---|---|---|---|
GPT-2-raw | 134.04 | 97.32 | 220.50 | 61.05 | 44.48 |
GPT-2-ref | 122.43 | 81.98 | 175.59 | 72.60 | 50.98 |
If you find our work useful, please cite this repository or our paper:
@inproceedings{10.1145/3626772.3657671,
author = {Sun, Yiding and Wang, Feng and Zhu, Yutao and Zhao, Wayne Xin and Mao, Jiaxin},
title = {An Integrated Data Processing Framework for Pretraining Foundation Models},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657671},
doi = {10.1145/3626772.3657671},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2713–2718},
numpages = {6},
keywords = {data processing, data quality, large language models},
location = {Washington DC, USA},
series = {SIGIR '24}
}