This repository contains the data preparation code for K2-65B, a 65 billion parameter large language model from LLM360.
Note
This repository is under active development. If you have suggestions or find bugs, please open a GitHub issue or reach out.
python prepare_jsonl_chunks.py
python prepare_pile_of_law_chunks.py
python prepare_redpajama_chunks.py
python prepare_starcoder_chunks.py
python tokenize_datasets.py
python starcoder_fim_main.py --spm_rate 0.
python starcoder_fim_main.py --spm_rate 1.
python gather.py
python shuffle.py
python analyze.py