BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

News

📖 [2024/08/28] We released our Technical Report on arXiv. Base and SFT model checkpoints of BaichuanSEED, and partial pretraining data will be coming soon.

Main Contribution

We propose a universal applicable data processing pipeline, including broad collection to scale up and reweighting to deduplicate and improve the data quality.
We train a competitive 7B LLM baseline BaichuanSEED from scratch with 3T data processed by the aforementioned pipeline, followed by a simple yet effective supervised fine-tuning. Our model is consistent and predictable, and achieves comparable performance on comprehensive benchmarks with cutting-edge commercial LLMs without any deliberate optimization.

Universal Data Processing Pipeline

The details can be found in our technical report. The pipeline mainly consists of:

Broad Collection: broad collection from trusted sources, mainly including web pages, high knowledge density data, code.
Reweighting: deduplication and mixture
- Deduplication
  - Document-level deduplication globally
  - Sentence-level deduplication across documents
  - PII and harmful content filtering
- Mixture
  - Heauristic mixture experiments

Evaluation

Scaling Curves

We introduce two important attributes during training of LLMs, consistency and predictablility:

Consistency: the ability to gain uniform improvements across all evaluation benchmarks before and after SFT. (upper picture)
Predictability: the ability to forecast the capabilities of later checkpoints based on performance of earlier checkpoints. (lower picture)

Comprehensive Benchmarks

Model	Training Tokens	MMLU (5-shot)	CMMLU (5-shot)	AGIEval (0-shot)	C-Eval (5-shot)	MMLU-Pro (5-shot)	LiveBench (0-shot)
Baichuan2-7B	2.6T	54.65	56.95	28.95	56.19	21.65	-
Baichuan2-13B	2.6T	59.83	61.32	24.07	58.10	26.59	-
Qwen1.5-7B	3T	62.19	71.84	39.46	73.64	30.30	-
Llama3-8B	15T	66.57	50.68	26.74	49.89	35.30	-
OLMo-7B	2.5T	28.40	25.55	19.89	27.27	13.05	-
MAP-Neo-7B	4.5T	58.18	55.06	33.87	57.50	26.89	-
BaichuanSEED	3T	60.25	62.09	31.07	61.58	26.57	-

Baichuan2-7B-Chat	2.6T	54.35	55.36	35.29	55.09	25.11	12.89
Baichuan2-13B-Chat	2.6T	57.28	61.32	30.15	58.04	28.03	13.04
Qwen1.5-7B-Chat	3T	61.49	68.02	39.29	68.96	16.29	16.78
Llama3-8B-Instruct	15T	67.10	51.66	38.37	50.71	41.88	25.91
OLMo-7B-SFT	2.5T	47.49	35.49	29.12	35.43	17.99	8.80
MAP-Neo-7B-SFT	4.5T	58.31	55.24	37.98	55.58	30.24	14.35
BaichuanSEED-SFT	3T	60.15	60.84	32.62	59.41	29.63	18.32

Downstream Tasks

Model	Training Tokens	MBPP (3-shot)	HumanEval (0-shot)	MATH (4-shot)	GSM8K (4-shot)	TriviaQA (0-shot)	HellaSwag (0-shot)
Baichuan2-7B	2.6T	25.40	17.68	5.94	25.02	53.73	67.56
Baichuan2-13B	2.6T	30.88	17.07	10.68	52.08	58.73	71.09
Qwen1.5-7B	3T	36.60	53.05	21.08	54.74	50.92	72.64
Llama3-8B	15T	44.60	26.22	13.44	50.11	65.23	74.54
OLMo-7B	2.5T	21.00	11.59	1.72	2.00	49.81	70.31
MAP-Neo-7B	4.5T	25.90	7.93	15.14	53.90	54.80	67.85
BaichuanSEED	3T	34.12	21.34	9.84	38.81	45.92	70.20

Baichuan2-7B-Chat	2.6T	22.40	15.24	8.70	32.37	44.65	69.18
Baichuan2-13B-Chat	2.6T	26.30	18.90	8.62	56.79	53.47	72.32
Qwen1.5-7B-Chat	3T	12.58	29.27	13.12	56.10	10.22	72.81
Llama3-8B-Instruct	15T	52.17	21.34	25.62	78.17	63.37	71.45
OLMo-7B-SFT	2.5T	25.16	19.51	2.52	17.66	42.87	72.62
MAP-Neo-7B-SFT	4.5T	33.66	29.27	30.86	70.28	53.82	68.48
BaichuanSEED-SFT	3T	37.60	23.17	14.06	53.98	43.92	73.03

Citation

@misc{dong2024baichuanseedsharingpotentialextensive,
      title={BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline}, 
      author={Guosheng Dong and Da Pan and Yiding Sun and Shusen Zhang and Zheng Liang and Xin Wu and Yanjun Shen and Fan Yang and Haoze Sun and Tianpeng Li and Mingan Lin and Jianhua Xu and Yufan Zhang and Xiaonan Nie and Lei Su and Bingning Wang and Wentao Zhang and Jiaxin Mao and Zenan Zhou and Weipeng Chen},
      year={2024},
      eprint={2408.15079},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.15079}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
static		static
.nojekyll		.nojekyll
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

News

Main Contribution

Universal Data Processing Pipeline

Evaluation

Scaling Curves

Comprehensive Benchmarks

Downstream Tasks

Citation

About

Releases

Packages

Contributors 2

Languages

BaichuanSEED/BaichuanSEED.github.io

Folders and files

Latest commit

History

Repository files navigation

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

News

Main Contribution

Universal Data Processing Pipeline

Evaluation

Scaling Curves

Comprehensive Benchmarks

Downstream Tasks

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages