bigcode-data-mix

This repository contains scripts and files related to the data-mix for training.

1 (optional) - Generate the templates

Run the following script to generate the data templates: data/(train/valid/test)_data_paths.txt

python scripts/generate_data_args.py

2 - Substitute the data path

To obtain the final file that can be used by the training script, run the following commands:

export DATA_PATH=/path/to/tokenized/datasets
envsubst < data/train_data_paths.txt > data/train_data_paths.txt.tmp
envsubst < data/valid_data_paths.txt > data/valid_data_paths.txt.tmp
envsubst < data/test_data_paths.txt > data/test_data_paths.txt.tmp

In Megatron, pass the following arguments

--train-weighted-split-paths-path /path/to/train_data_paths.txt.tmp \
--valid-weighted-split-paths-path /path/to/valid_data_paths.txt.tmp \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bigcode-data-mix

1 (optional) - Generate the templates

2 - Substitute the data path

Files

README.md

Latest commit

History

README.md

File metadata and controls

bigcode-data-mix

1 (optional) - Generate the templates

2 - Substitute the data path