Skip to content

Official Repository for Paper "BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline"

Notifications You must be signed in to change notification settings

BaichuanSEED/BaichuanSEED.github.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Paper | Github | HuggingFace(TBU)

News

📖 [2024/08/28] We released our Technical Report on arXiv. Base and SFT model checkpoints of BaichuanSEED, and partial pretraining data will be coming soon.

Main Contribution

  1. We propose a universal applicable data processing pipeline, including broad collection to scale up and reweighting to deduplicate and improve the data quality.
  2. We train a competitive 7B LLM baseline BaichuanSEED from scratch with 3T data processed by the aforementioned pipeline, followed by a simple yet effective supervised fine-tuning. Our model is consistent and predictable, and achieves comparable performance on comprehensive benchmarks with cutting-edge commercial LLMs without any deliberate optimization.

Universal Data Processing Pipeline

The details can be found in our technical report. The pipeline mainly consists of:

  • Broad Collection: broad collection from trusted sources, mainly including web pages, high knowledge density data, code.
  • Reweighting: deduplication and mixture
    • Deduplication
      • Document-level deduplication globally
      • Sentence-level deduplication across documents
      • PII and harmful content filtering
    • Mixture
      • Heauristic mixture experiments

Evaluation

Scaling Curves

We introduce two important attributes during training of LLMs, consistency and predictablility:

  • Consistency: the ability to gain uniform improvements across all evaluation benchmarks before and after SFT. (upper picture)
  • Predictability: the ability to forecast the capabilities of later checkpoints based on performance of earlier checkpoints. (lower picture)

Comprehensive Benchmarks

Model Training Tokens MMLU (5-shot) CMMLU (5-shot) AGIEval (0-shot) C-Eval (5-shot) MMLU-Pro (5-shot) LiveBench (0-shot)
Baichuan2-7B 2.6T 54.65 56.95 28.95 56.19 21.65 -
Baichuan2-13B 2.6T 59.83 61.32 24.07 58.10 26.59 -
Qwen1.5-7B 3T 62.19 71.84 39.46 73.64 30.30 -
Llama3-8B 15T 66.57 50.68 26.74 49.89 35.30 -
OLMo-7B 2.5T 28.40 25.55 19.89 27.27 13.05 -
MAP-Neo-7B 4.5T 58.18 55.06 33.87 57.50 26.89 -
BaichuanSEED 3T 60.25 62.09 31.07 61.58 26.57 -
Baichuan2-7B-Chat 2.6T 54.35 55.36 35.29 55.09 25.11 12.89
Baichuan2-13B-Chat 2.6T 57.28 61.32 30.15 58.04 28.03 13.04
Qwen1.5-7B-Chat 3T 61.49 68.02 39.29 68.96 16.29 16.78
Llama3-8B-Instruct 15T 67.10 51.66 38.37 50.71 41.88 25.91
OLMo-7B-SFT 2.5T 47.49 35.49 29.12 35.43 17.99 8.80
MAP-Neo-7B-SFT 4.5T 58.31 55.24 37.98 55.58 30.24 14.35
BaichuanSEED-SFT 3T 60.15 60.84 32.62 59.41 29.63 18.32

Downstream Tasks

Model Training Tokens MBPP (3-shot) HumanEval (0-shot) MATH (4-shot) GSM8K (4-shot) TriviaQA (0-shot) HellaSwag (0-shot)
Baichuan2-7B 2.6T 25.40 17.68 5.94 25.02 53.73 67.56
Baichuan2-13B 2.6T 30.88 17.07 10.68 52.08 58.73 71.09
Qwen1.5-7B 3T 36.60 53.05 21.08 54.74 50.92 72.64
Llama3-8B 15T 44.60 26.22 13.44 50.11 65.23 74.54
OLMo-7B 2.5T 21.00 11.59 1.72 2.00 49.81 70.31
MAP-Neo-7B 4.5T 25.90 7.93 15.14 53.90 54.80 67.85
BaichuanSEED 3T 34.12 21.34 9.84 38.81 45.92 70.20
Baichuan2-7B-Chat 2.6T 22.40 15.24 8.70 32.37 44.65 69.18
Baichuan2-13B-Chat 2.6T 26.30 18.90 8.62 56.79 53.47 72.32
Qwen1.5-7B-Chat 3T 12.58 29.27 13.12 56.10 10.22 72.81
Llama3-8B-Instruct 15T 52.17 21.34 25.62 78.17 63.37 71.45
OLMo-7B-SFT 2.5T 25.16 19.51 2.52 17.66 42.87 72.62
MAP-Neo-7B-SFT 4.5T 33.66 29.27 30.86 70.28 53.82 68.48
BaichuanSEED-SFT 3T 37.60 23.17 14.06 53.98 43.92 73.03

Citation

@misc{dong2024baichuanseedsharingpotentialextensive,
      title={BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline}, 
      author={Guosheng Dong and Da Pan and Yiding Sun and Shusen Zhang and Zheng Liang and Xin Wu and Yanjun Shen and Fan Yang and Haoze Sun and Tianpeng Li and Mingan Lin and Jianhua Xu and Yufan Zhang and Xiaonan Nie and Lei Su and Bingning Wang and Wentao Zhang and Jiaxin Mao and Zenan Zhou and Weipeng Chen},
      year={2024},
      eprint={2408.15079},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.15079}, 
}

About

Official Repository for Paper "BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published