diff --git a/LICENSE b/LICENSE index bc0945c47..5033caeee 100644 --- a/LICENSE +++ b/LICENSE @@ -251,7 +251,7 @@ Code in data_juicer/ops/mapper/clean_copyright_mapper.py, data_juicer/ops/mapper data_juicer/ops/mapper/expand_macro_mapper.py, data_juicer/ops/mapper/remove_bibliography_mapper.py, data_juicer/ops/mapper/remove_comments_mapper.py, data_juicer/ops/mapper/remove_header_mapper.py, is adapted from -https://github.com/togethercomputer/RedPajama-Data +https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/ Copyright 2023 RedPajama authors. diff --git a/README.md b/README.md index ae131073a..43105b5b6 100644 --- a/README.md +++ b/README.md @@ -350,7 +350,7 @@ Cloud's platform for AI (PAI). We look forward to more of your experience, suggestions and discussions for collaboration! Data-Juicer thanks and refers to several community projects, such as -[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... +[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... diff --git a/README_ZH.md b/README_ZH.md index 1b5b9f50e..e22e506b2 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -328,7 +328,7 @@ Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通 Data-Juicer 感谢并参考了社区开源项目: -[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... +[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... diff --git a/configs/reproduced_redpajama/README.md b/configs/reproduced_redpajama/README.md index e17703425..b6a0b12b1 100644 --- a/configs/reproduced_redpajama/README.md +++ b/configs/reproduced_redpajama/README.md @@ -1,9 +1,9 @@ # Redpajama Config Files -This folder contains example configuration files to easily and quickly reproduce the processing flow of [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep). +This folder contains example configuration files to easily and quickly reproduce the processing flow of [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep). ## arXiv -The raw data files can be downloaded from the same AWS link as in [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv). +The raw data files can be downloaded from the same AWS link as in [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv). Once downloaded, use [raw_arxiv_to_jsonl.py](../../tools/preprocess/raw_arxiv_to_jsonl.py) to convert from the original format to `jsonl` that Data-Juicer can handle easily: @@ -30,7 +30,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arx ## Books -The raw data files can be downloaded from the same HuggingFace datasets as in [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/book). +The raw data files can be downloaded from the same HuggingFace datasets as in [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/book). Once downloaded, modify the path configurations in [redpajama-books.yaml](redpajama-books.yaml) and execute the following command to reproduce the processing flow of RedPajama. @@ -47,7 +47,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-boo ## Code -The raw data files can be downloaded from Google BigQuery as in [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/github). +The raw data files can be downloaded from Google BigQuery as in [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/github). Once downloaded, unzip and delete files whose extensions are not in the following whitelist: @@ -70,7 +70,7 @@ python tools/process_data.py --config configs/redpajama/redpajama-code.yaml ## StackExchange -The raw data files can be downloaded from the same Archive link as in [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange). +The raw data files can be downloaded from the same Archive link as in [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange). Once downloaded, use [raw_stackexchange_to_jsonl.py](../../tools/preprocess/raw_stackexchange_to_jsonl.py) to convert from the original format to `jsonl` that Data-Juicer can handle easily: diff --git a/configs/reproduced_redpajama/README_ZH.md b/configs/reproduced_redpajama/README_ZH.md index 9a527c093..41c487f61 100644 --- a/configs/reproduced_redpajama/README_ZH.md +++ b/configs/reproduced_redpajama/README_ZH.md @@ -1,10 +1,10 @@ # Redpajama 配置文件 -此文件夹包含的配置文件用于轻松复现 [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep) 的处理流程。 +此文件夹包含的配置文件用于轻松复现 [Redpajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep) 的处理流程。 ## arXiv -原始数据文件从 [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv) 中相同的 AWS 链接下载。 +原始数据文件从 [Redpajama/arXiv](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv) 中相同的 AWS 链接下载。 下载完成后,使用 [raw_arxiv_to_jsonl.py](../../tools/preprocess/raw_arxiv_to_jsonl.py) 将原始格式转换为 Data-Juicer 易于处理的格式: @@ -31,7 +31,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-arx ## Books -原始数据文件从 [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/book) 中相同的 HuggingFace 链接下载。 +原始数据文件从 [Redpajama/Books](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/book) 中相同的 HuggingFace 链接下载。 下载完成后,修改 [redpajama-books.yaml](redpajama-books.yaml) 中的数据路径,执行以下命令复现 RedPajama 的处理流程: @@ -48,7 +48,7 @@ python tools/process_data.py --config configs/reproduced_redpajama/redpajama-boo ## Code -原始数据文件从 [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/github) 中相同的 Google BigQuery 获取。 +原始数据文件从 [Redpajama/Code](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/github) 中相同的 Google BigQuery 获取。 下载完成后,解压缩并删除扩展名不在以下白名单中的其他文件: @@ -71,7 +71,7 @@ python tools/process_data.py --config configs/redpajama/redpajama-code.yaml ## StackExchange -原始数据文件从 [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange) 中相同的 Archive 链接获取。 +原始数据文件从 [Redpajama/Stack_exchange](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange) 中相同的 Archive 链接获取。 下载完成后,使用 [raw_stackexchange_to_jsonl.py](../../tools/preprocess/raw_stackexchange_to_jsonl.py) 将原始格式转换为 Data-Juicer 易于处理的格式: diff --git a/data_juicer/ops/mapper/clean_copyright_mapper.py b/data_juicer/ops/mapper/clean_copyright_mapper.py index c5b046d0e..dabb0cd40 100644 --- a/data_juicer/ops/mapper/clean_copyright_mapper.py +++ b/data_juicer/ops/mapper/clean_copyright_mapper.py @@ -1,5 +1,5 @@ # Some code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/ +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/ # -------------------------------------------------------- import regex as re diff --git a/data_juicer/ops/mapper/clean_html_mapper.py b/data_juicer/ops/mapper/clean_html_mapper.py index dc45754fa..5c2c30c57 100644 --- a/data_juicer/ops/mapper/clean_html_mapper.py +++ b/data_juicer/ops/mapper/clean_html_mapper.py @@ -1,5 +1,5 @@ # Some code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/ +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/ # -------------------------------------------------------- from data_juicer.utils.availability_utils import AvailabilityChecking diff --git a/data_juicer/ops/mapper/expand_macro_mapper.py b/data_juicer/ops/mapper/expand_macro_mapper.py index 1792796ca..2f5d7fe83 100644 --- a/data_juicer/ops/mapper/expand_macro_mapper.py +++ b/data_juicer/ops/mapper/expand_macro_mapper.py @@ -1,5 +1,5 @@ # Some code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py +# https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/arxiv_cleaner.py # -------------------------------------------------------- import regex as re diff --git a/data_juicer/ops/mapper/remove_bibliography_mapper.py b/data_juicer/ops/mapper/remove_bibliography_mapper.py index 7a5c815ca..2ce852d66 100644 --- a/data_juicer/ops/mapper/remove_bibliography_mapper.py +++ b/data_juicer/ops/mapper/remove_bibliography_mapper.py @@ -1,5 +1,5 @@ # Some code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/ +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/ # -------------------------------------------------------- import regex as re diff --git a/data_juicer/ops/mapper/remove_comments_mapper.py b/data_juicer/ops/mapper/remove_comments_mapper.py index b3533dd2b..c5f083c14 100644 --- a/data_juicer/ops/mapper/remove_comments_mapper.py +++ b/data_juicer/ops/mapper/remove_comments_mapper.py @@ -1,5 +1,5 @@ # Some code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/ +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/ # -------------------------------------------------------- from typing import List, Union diff --git a/data_juicer/ops/mapper/remove_header_mapper.py b/data_juicer/ops/mapper/remove_header_mapper.py index 4c36bde64..45af546e5 100644 --- a/data_juicer/ops/mapper/remove_header_mapper.py +++ b/data_juicer/ops/mapper/remove_header_mapper.py @@ -1,5 +1,5 @@ # Some code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/ +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/ # -------------------------------------------------------- import regex as re diff --git a/tools/preprocess/README.md b/tools/preprocess/README.md index 6a33910ed..b0bf5c3ae 100644 --- a/tools/preprocess/README.md +++ b/tools/preprocess/README.md @@ -49,7 +49,7 @@ python tools/preprocess/raw_arxiv_to_jsonl.py --help **Note:** -* For downloading process, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv). +* For downloading process, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv). * Before you downloading, converting or processing, you might make sure that your drive space is large enough to store the raw data (over 3TB), converted data (over 3TB), at least processed data (about 500-600GB), and even more cache data during processing. @@ -71,7 +71,7 @@ python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \ # get help python tools/preprocess/raw_stackexchange_to_jsonl.py --help ``` -- `src_dir`: if you download raw Stack Exchange data as Redpajama did, you will get a directory src which includes hundreds of 7z files whose filenames are like `*.*.com.7z `. You need to unzip these files and rename the POSTs.xml to the corresponding compressed package name and place it in that dir. For more details, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange). +- `src_dir`: if you download raw Stack Exchange data as Redpajama did, you will get a directory src which includes hundreds of 7z files whose filenames are like `*.*.com.7z `. You need to unzip these files and rename the POSTs.xml to the corresponding compressed package name and place it in that dir. For more details, please refer to [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange). - `target_dir`: result directory to store the converted jsonl files. - `topk` (optional): select the topk sites with the most content. Default it's 28. - `num_proc` (optional): number of process workers. Default it's 1. diff --git a/tools/preprocess/README_ZH.md b/tools/preprocess/README_ZH.md index f715a50df..8f2799ed2 100644 --- a/tools/preprocess/README_ZH.md +++ b/tools/preprocess/README_ZH.md @@ -48,7 +48,7 @@ python tools/preprocess/raw_arxiv_to_jsonl.py --help **注意事项:** -* 下载过程请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv)。 +* 下载过程请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv)。 * 在下载、转换或处理之前,您需要确保您的硬盘空间足够大,可以存储原始数据(超过 3TB)、转换后的数据(超过 3TB)、最小处理后的数据(大约 500-600GB),以及处理期间的缓存数据。 @@ -69,7 +69,7 @@ python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py \ python tools/preprocess/raw_stackexchange_to_jsonl.py --help ``` -- `src_dir`: 如果像 Redpajama 一样下载原始 Stack Exchange 数据,你将得到一个目录 src,其中包含数百个 7z 文件,其文件名类似于 `*.*.com.7z`。 您需要解压这些文件并将 POSTs.xml 重命名为相应的压缩包名称并将其放在该目录中。更多详情请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange)。 +- `src_dir`: 如果像 Redpajama 一样下载原始 Stack Exchange 数据,你将得到一个目录 src,其中包含数百个 7z 文件,其文件名类似于 `*.*.com.7z`。 您需要解压这些文件并将 POSTs.xml 重命名为相应的压缩包名称并将其放在该目录中。更多详情请参考[这里](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange)。 - `target_dir`: 用于存储转换后的 jsonl 文件的结果目录。 - `topk` (可选): 选择内容最多的 k 个站点,默认为 28. - `num_proc` (可选): worker 进程数量,默认为 1。 diff --git a/tools/preprocess/raw_arxiv_to_jsonl.py b/tools/preprocess/raw_arxiv_to_jsonl.py index 1b1637cae..d92efd235 100644 --- a/tools/preprocess/raw_arxiv_to_jsonl.py +++ b/tools/preprocess/raw_arxiv_to_jsonl.py @@ -1,12 +1,12 @@ # Part of the code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py +# https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/arxiv_cleaner.py # -------------------------------------------------------- # # This tool is used for converting the raw arxiv data downloaded from S3 # (ref: https://info.arxiv.org/help/bulk_data_s3.html) to several jsonl files. # # For downloading process, please refer to: -# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv # # Notice: before you downloading, converting or processing, you might make sure # that your drive space is large enough to store the raw data (over 3TB), diff --git a/tools/preprocess/raw_stackexchange_to_jsonl.py b/tools/preprocess/raw_stackexchange_to_jsonl.py index a9f267211..ad1a0bfe4 100644 --- a/tools/preprocess/raw_stackexchange_to_jsonl.py +++ b/tools/preprocess/raw_stackexchange_to_jsonl.py @@ -1,5 +1,5 @@ # Part of the code here has been modified from: -# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange # -------------------------------------------------------- # # This tool is used for converting the raw Stack Exchange data downloaded from @@ -7,7 +7,7 @@ # jsonl files. # # For downloading process, please refer to: -# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/stack_exchange +# https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/stack_exchange # # Notice: before you downloading, converting or processing, you might make sure # that your drive space is large enough to store the raw data (over 100GB),