Skip to content

Code for preprint "Metadata Conditioning Accelerates Language Model Pre-training (MeCo)"

Notifications You must be signed in to change notification settings

princeton-pli/MeCo

Repository files navigation

Metadata Conditioning Accelerates Language Model Pre-training (MeCo)

[Paper] [HF Page]

This is the homepage for paper Metadata Conditioning Accelerates Language Model Pre-training.

We propose a new pre-training method named metadata conditioning then cooldown (MeCo): it conditions pre-training texts with their metadata (such as source URLs) by prepending the metadata to the corresponding documents; at the end of training, it switches to a cooldown phase with only the standard texts to enable inference without metadata.

MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data.

alt text

Authors: Tianyu Gao (tianyug@princeton.edu), Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen

Quick Links

Release Progress

  • Training code
  • Checkpoints
  • Data and data preparation
  • Training readme
  • Evaluation readme

Requirements

Please install all the requirements by running pip install -r requirements.txt.

Data

Download data. All our experiments are conducted on public available datasets: DCLM-Baseline (our main experiment), DCLM-reproduced-RefinedWeb, and SlimPajama-C4. For the convenience of reproducing our results, we uploaded the tokenized subsets used in our experiments to AWS S3. To download the data, you need to have an AWS account (with an access key and a secret key). Note that data downloading will incur a charge on your AWS account. According to this S3 document, each GB of data downloaded incurs $0.09 and the first 100GB is free. You can download the data using the following commands:

# Install AWS CLI if you haven't already
pip install awscli

# Configure AWS CLI with your credentials (you will need an access key and a secret key from your AWS account)
aws configure

# Download the raw code repo data (concatenated by repo names from the stack v1) 
aws s3 sync s3://princetonpli-data/MeCo/ data/ --request-payer requester

Below is the available unpacked tokenized data (by the Llama-3 tokenizer). All data is in the mosaicml-streaming format, with the following fields: input_ids (int32 numpy array, the Llama-3 tokenized document with no BOS/EOS), url (str, the full source URL), and length (int32, number of tokens).

Data Size S3 path
DCLM 1.3T (270B tokens) s3://princetonpli-data/MeCo/DCLM-unpacked/
DCLM (for cooldown only) 443GB (89B tokens) s3://princetonpli-data/MeCo/DCLM-cooldown-unpacked/

Pack data. Use the tools provided in datatools/ to pack data (the following example uses 40 processes):

# Baseline data (~1TB)
python datatools/mds_pack_data.py --source data/DCLM-unpacked  --target data/DCLM --source_type mds --domain dclm --target_lengths 8192  --num_workers 40  --strategy pack_complete 
python datatools/mds_merge.py data/DCLM

# Metadata conditioning data (~1TB)
python datatools/mds_pack_data.py --source data/DCLM-unpacked  --target data/DCLM-w-URLs --source_type mds --domain dclm-w-urls --target_lengths 8192  --num_workers 40  --strategy pack_complete --add_url --add_metadata_prefix "URL: "  --add_metadata_suffix "\n\n" --use_short_url --add_metadata_mask  
python datatools/mds_merge.py data/DCLM-w-URLs

# Cooldown data (~330GB)
python datatools/mds_pack_data.py --source data/DCLM-cooldown-unpacked  --target data/DCLM-cooldown --source_type mds --domain dclm-cooldown --target_lengths 8192  --num_workers 40  --strategy pack_complete 
python datatools/mds_merge.py data/DCLM-cooldown

Training

Distributed training. We provide two distributed training launching scripts: slurm_launcher.sh.example (for SLURM; supporting multi-node) and torchrun_launcher.sh.example (single-node). First, read the scripts, add the environment setup commands, and rename them to slurm_launcher.sh (which also requires srun_launcher.sh) and torchrun_launcher.sh.

# For torchrun
NUM_GPU=8 bash torchrun_launcher.sh --run_config run_configs/dclm_1.6b_160b_baseline.yaml

# For slurm (change the script for different runtime/log path/etc.)
jobname=baseline bash slurm_launcher.sh --run_config run_configs/dclm_1.6b_160b_baseline.yaml

Run configs. We provide the following run configs in run_configs/ for easy reproduction of our results. Any config can be overwritten by adding command line arguments.

  • Reproducing the baseline: dclm_1.6b_160b_baseline.yaml
  • Reproducing MeCo
    • First run dclm_1.6b_160b_meco_stage1.yaml.
    • Prepare for cooldown
    cd result/dclm_1.6b_160b_meco_stage1
    # We will use checkpoint-36000 (90%). We need to remove the data loader state here since we are switching to cooldown data; keep everything else (scheduler, optimizer state, etc.)
    mkdir checkpoint-36000-nodatastate 
    cd checkpoint-36000-nodatastate 
    ln -s ../checkpoint-36000/* .
    rm streaming_dataset_state.json
    • Then run dclm_1.6b_160b_meco_stage2.yaml.

Evaluation

Coming soon!

Downloading models

You can download the checkpoints in our experiments from our Hugging Face collection.

Citation

@article{gao2025meco,
  title={Metadata Conditioning Accelerates Language Model Pre-training},
  author={Tianyu Gao and Alexander Wettig and Luxi He and Yihe Dong and Sadhika Malladi and Danqi Chen},
  journal={arXiv preprint arXiv:2501.01956},
  year={2025}
}

About

Code for preprint "Metadata Conditioning Accelerates Language Model Pre-training (MeCo)"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published