Data Preparation Code for LLM360 K2-65B

This repository contains the data preparation code for K2-65B, a 65 billion parameter large language model from LLM360.

Note

This repository is under active development. If you have suggestions or find bugs, please open a GitHub issue or reach out.

Prepare Raw Data Chunks

python prepare_jsonl_chunks.py
python prepare_pile_of_law_chunks.py
python prepare_redpajama_chunks.py
python prepare_starcoder_chunks.py

Tokenize

python tokenize_datasets.py

Create FIM Data for StarCoder

python starcoder_fim_main.py --spm_rate 0.
python starcoder_fim_main.py --spm_rate 1.

Gather into 360 chunks

python gather.py

Shuffle

python shuffle.py

Print Data Mix

python analyze.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preparation Code for LLM360 K2-65B

Prepare Raw Data Chunks

Tokenize

Create FIM Data for StarCoder

Gather into 360 chunks

Shuffle

Print Data Mix

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
analyze.py		analyze.py
gather.py		gather.py
prepare_jsonl_chunks.py		prepare_jsonl_chunks.py
prepare_pile_of_law_chunks.py		prepare_pile_of_law_chunks.py
prepare_redpajama_chunks.py		prepare_redpajama_chunks.py
prepare_starcoder_chunks.py		prepare_starcoder_chunks.py
shuffle.py		shuffle.py
starcoder_fim_main.py		starcoder_fim_main.py
tokenize_datasets.py		tokenize_datasets.py

License

LLM360/k2-data-prep

Folders and files

Latest commit

History

Repository files navigation

Data Preparation Code for LLM360 K2-65B

Prepare Raw Data Chunks

Tokenize

Create FIM Data for StarCoder

Gather into 360 chunks

Shuffle

Print Data Mix

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages