Skip to content

Latest commit

 

History

History
27 lines (20 loc) · 865 Bytes

README.md

File metadata and controls

27 lines (20 loc) · 865 Bytes

bigcode-data-mix

This repository contains scripts and files related to the data-mix for training.

1 (optional) - Generate the templates

Run the following script to generate the data templates: data/(train/valid/test)_data_paths.txt

python scripts/generate_data_args.py

2 - Substitute the data path

To obtain the final file that can be used by the training script, run the following commands:

export DATA_PATH=/path/to/tokenized/datasets
envsubst < data/train_data_paths.txt > data/train_data_paths.txt.tmp
envsubst < data/valid_data_paths.txt > data/valid_data_paths.txt.tmp
envsubst < data/test_data_paths.txt > data/test_data_paths.txt.tmp

In Megatron, pass the following arguments

--train-weighted-split-paths-path /path/to/train_data_paths.txt.tmp \
--valid-weighted-split-paths-path /path/to/valid_data_paths.txt.tmp \