BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
Paper | Github | HuggingFace(TBU)
📖 [2024/08/28] We released our Technical Report on arXiv. Base and SFT model checkpoints of BaichuanSEED, and partial pretraining data will be coming soon.
- We propose a universal applicable data processing pipeline, including broad collection to scale up and reweighting to deduplicate and improve the data quality.
- We train a competitive 7B LLM baseline BaichuanSEED from scratch with 3T data processed by the aforementioned pipeline, followed by a simple yet effective supervised fine-tuning. Our model is consistent and predictable, and achieves comparable performance on comprehensive benchmarks with cutting-edge commercial LLMs without any deliberate optimization.
The details can be found in our technical report. The pipeline mainly consists of:
- Broad Collection: broad collection from trusted sources, mainly including web pages, high knowledge density data, code.
- Reweighting: deduplication and mixture
- Deduplication
- Document-level deduplication globally
- Sentence-level deduplication across documents
- PII and harmful content filtering
- Mixture
- Heauristic mixture experiments
- Deduplication
We introduce two important attributes during training of LLMs, consistency and predictablility:
- Consistency: the ability to gain uniform improvements across all evaluation benchmarks before and after SFT. (upper picture)
- Predictability: the ability to forecast the capabilities of later checkpoints based on performance of earlier checkpoints. (lower picture)
Model | Training Tokens | MMLU (5-shot) | CMMLU (5-shot) | AGIEval (0-shot) | C-Eval (5-shot) | MMLU-Pro (5-shot) | LiveBench (0-shot) |
---|---|---|---|---|---|---|---|
Baichuan2-7B | 2.6T | 54.65 | 56.95 | 28.95 | 56.19 | 21.65 | - |
Baichuan2-13B | 2.6T | 59.83 | 61.32 | 24.07 | 58.10 | 26.59 | - |
Qwen1.5-7B | 3T | 62.19 | 71.84 | 39.46 | 73.64 | 30.30 | - |
Llama3-8B | 15T | 66.57 | 50.68 | 26.74 | 49.89 | 35.30 | - |
OLMo-7B | 2.5T | 28.40 | 25.55 | 19.89 | 27.27 | 13.05 | - |
MAP-Neo-7B | 4.5T | 58.18 | 55.06 | 33.87 | 57.50 | 26.89 | - |
BaichuanSEED | 3T | 60.25 | 62.09 | 31.07 | 61.58 | 26.57 | - |
Baichuan2-7B-Chat | 2.6T | 54.35 | 55.36 | 35.29 | 55.09 | 25.11 | 12.89 |
Baichuan2-13B-Chat | 2.6T | 57.28 | 61.32 | 30.15 | 58.04 | 28.03 | 13.04 |
Qwen1.5-7B-Chat | 3T | 61.49 | 68.02 | 39.29 | 68.96 | 16.29 | 16.78 |
Llama3-8B-Instruct | 15T | 67.10 | 51.66 | 38.37 | 50.71 | 41.88 | 25.91 |
OLMo-7B-SFT | 2.5T | 47.49 | 35.49 | 29.12 | 35.43 | 17.99 | 8.80 |
MAP-Neo-7B-SFT | 4.5T | 58.31 | 55.24 | 37.98 | 55.58 | 30.24 | 14.35 |
BaichuanSEED-SFT | 3T | 60.15 | 60.84 | 32.62 | 59.41 | 29.63 | 18.32 |
Model | Training Tokens | MBPP (3-shot) | HumanEval (0-shot) | MATH (4-shot) | GSM8K (4-shot) | TriviaQA (0-shot) | HellaSwag (0-shot) |
---|---|---|---|---|---|---|---|
Baichuan2-7B | 2.6T | 25.40 | 17.68 | 5.94 | 25.02 | 53.73 | 67.56 |
Baichuan2-13B | 2.6T | 30.88 | 17.07 | 10.68 | 52.08 | 58.73 | 71.09 |
Qwen1.5-7B | 3T | 36.60 | 53.05 | 21.08 | 54.74 | 50.92 | 72.64 |
Llama3-8B | 15T | 44.60 | 26.22 | 13.44 | 50.11 | 65.23 | 74.54 |
OLMo-7B | 2.5T | 21.00 | 11.59 | 1.72 | 2.00 | 49.81 | 70.31 |
MAP-Neo-7B | 4.5T | 25.90 | 7.93 | 15.14 | 53.90 | 54.80 | 67.85 |
BaichuanSEED | 3T | 34.12 | 21.34 | 9.84 | 38.81 | 45.92 | 70.20 |
Baichuan2-7B-Chat | 2.6T | 22.40 | 15.24 | 8.70 | 32.37 | 44.65 | 69.18 |
Baichuan2-13B-Chat | 2.6T | 26.30 | 18.90 | 8.62 | 56.79 | 53.47 | 72.32 |
Qwen1.5-7B-Chat | 3T | 12.58 | 29.27 | 13.12 | 56.10 | 10.22 | 72.81 |
Llama3-8B-Instruct | 15T | 52.17 | 21.34 | 25.62 | 78.17 | 63.37 | 71.45 |
OLMo-7B-SFT | 2.5T | 25.16 | 19.51 | 2.52 | 17.66 | 42.87 | 72.62 |
MAP-Neo-7B-SFT | 4.5T | 33.66 | 29.27 | 30.86 | 70.28 | 53.82 | 68.48 |
BaichuanSEED-SFT | 3T | 37.60 | 23.17 | 14.06 | 53.98 | 43.92 | 73.03 |
@misc{dong2024baichuanseedsharingpotentialextensive,
title={BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline},
author={Guosheng Dong and Da Pan and Yiding Sun and Shusen Zhang and Zheng Liang and Xin Wu and Yanjun Shen and Fan Yang and Haoze Sun and Tianpeng Li and Mingan Lin and Jianhua Xu and Yufan Zhang and Xiaonan Nie and Lei Su and Bingning Wang and Wentao Zhang and Jiaxin Mao and Zenan Zhou and Weipeng Chen},
year={2024},
eprint={2408.15079},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.15079},
}