This repository contains the code for preparing the training dataset for CrystalCoder, a 7B-parameter language model pre-trained on code and natural language.
The processed dataset for each phase is available at CrystalCoderDatasets. This repository contains the code for processing the dataset from scratch. Basically, we adhere to the procedure outlined in Cerebra's Model Zoo. Specifically, the data is prepared in the following steps:
- Download the untokenized SlimPajama and StarCoder data from the sources.
- Tokenize the data and concatenate documents to reach the maximum length limit. For the SlimPajama dataset, we evenly divided the tokenized files into two sections, categorizing them by the evenness or oddness of their file numbers for use in Stage 1 and Stage 2, respectively.
- Apply Fill-In-the-Middle (FIM) augmentation on the tokenized StarCoder data.
- Shuffle data within each domain and across epochs if there are multiple epochs.
mkdir data
cd data
# SlimPajama data
# StarCoder data
git lfs install
git clone https://huggingface.co/datasets/bigcode/starcoderdata
cd ../
# Code
git clone https://github.com/Cerebras/modelzoo.git
We tokenize the SlimPajama dataset (in jsonl
format) and StarCoder dataset (in parquet
format) to hdf5
format. This is done using the create_hdf5_dataset.py
script.
The script below is employed to divide the SlimPajama data into two equal portions based on the even or odd nature of their file numbers. The subset with even-numbered files is utilized for Stage 1, while the odd-numbered subset is designated for Stage 2.
for i in `ls | grep train_packed | grep -v "_part[01]of2"`
do
echo $i
for part in {0..1}
do
echo " Part $part"
dirname="${i}_part${part}of2"
mkdir -p $dirname
pushd . >&/dev/null
cd $dirname
for h5chunk in `ls ../$i/data-*.h5 | sort`
do
chunkid=`echo $h5chunk | sed 's/.*data-[0]*//' | sed 's/\.h5//' | sed 's/^$/0/'`
if [ $(($chunkid % 2)) == $part ]
then
ln -s $h5chunk
fi
done
popd >&/dev/null
done
done
First, we convert the original parquet
format to jsonl
format.
python parquet2jsonl.py
Next, we proceed to tokenize the data related to StarCoder for Stage 2 and Stage 3, respectively.
We tokenize the jsonl
files from all programming languages together:
python -B modelzoo/transformers/data_processing/scripts/hdf5_preprocessing/create_hdf5_dataset.py LMData \
--params configs/star_tokenizer_config.yaml \
--input_dir ./data/starcoderdata_jsonl --eos_id 2 --pad_id 2 \
--max_seq_length 2048 --output_dir ./data/starcoderdata_tokenized \
--seed 45 --processes 4 --split_text_to_tokenize True \
--ignore_bos_in_split_text True \
--encoder_file ./tokenizer.json
Here we tokenize the subfolders: Python
, HTML
, JaveScript
, CSS
independently using similar scripts.
bash scripts/stage3_tokenization_script.sh
In the tokenized StarCoder dataset, we implement token-level FIM augmentation while maintaining a constant SPM rate of 0.5, utilizing the fim_hdf5.py
script from this repository. For stage 2, the FIM rate is set at 0.9, whereas in stage 3, it is lowered to 0.3. Across both stages, we train on the corresponding StarCoder data over several epochs. FIM is applied independently to each epoch. Consequently, we prepare and store all the data for each epoch on disk prior to beginning the training process.
python fim_hdf5_stage2.py
python fim_hdf5_stage3.py
We shuffle and mix data from different sources and epochs for each stage as per the guidelines in h5_dataset_shuffle.py
.
bash scripts/shuffle.sh