How data is sampled? #55

macabdul9 · 2024-11-14T18:33:47Z

I want to train a model for one pass (1 epoch) on fineweb_edu_10bt_shuffled. Should I pass nSteps in the config which is:
nSteps = nExamples // (batch_size * nDevice * nAccumulation) if not how else I can ensure my model is trained on all of fineweb_edu_10bt_shuffled for at least 1 epoch (exactly).

The text was updated successfully, but these errors were encountered:

mathuvu · 2024-11-18T17:16:25Z

When preparing the data using our setup script, download_prepare_hf_data.py, there is an nchunks parameter that splits the dataset into nchunks .jsonl files (default: 32). Each GPU reads one of these files; for example, GPU 0 reads file 0, GPU 1 reads file 1, and so on. If the number of GPUs exceeds the number of files, the file index is determined using modulo. To utilize the full dataset with the default configuration, you’ll need at least 32 GPUs. Additionally, to avoid oversampling certain files, the number of GPUs should ideally be a multiple of nchunks.

Addressing your question about adding an epochs parameter, your calculation is close, but you need to account for the modulo when using more than 32 GPUs. The adjusted formula would be:nSteps = total_tokens // (batch_size * seqlen * nDevice * nAccumulation * (nDevice / nchunks) with total_tokens the total number of token in your dataset.

This calculation assumes a single data source (as in the default configuration). Keep in mind that some chunks may contain more tokens than others, as each line in a .jsonl file corresponds to a document rather than a fixed number of tokens

Lastly, for each data source, you can track how many times the data loader has looped over each file by examining the current_iter variable. This state is saved in the training checkpoint files (train_state_*.json).

macabdul9 · 2024-11-18T18:30:29Z

Thanks @mathuvu for your comment.

I have a follow-up.

So if I have nDevices << nchunks then the seen nchunks (number .jsonl files) would be just nDevices and training will loop over on portion of these chunks (== nDevices) defined in config until it completes nSteps?

data:
  root_dir: data/
  sources:
    fineweb_edu_10bt_shuffled: 1.0

If so - the best way to ensure that model is seeing all of fineweb_edu_10bt data (or any other data) is to process it first into nChunks same as nDevices? Can you please confirm? Thanks in advance.

mathuvu · 2024-11-18T18:58:53Z

So if I have nDevices << nchunks then the seen nchunks (number .jsonl files) would be just nDevices and training will loop over on portion of these chunks (== nDevices) defined in config until it completes nSteps?

Yes, it will loop on a subset of the data

If so - the best way to ensure that model is seeing all of fineweb_edu_10bt data (or any other data) is to process it first into nChunks same as nDevices?

You can put nchunks to 1 if you want to make things easier or nchunks = nDevices.

akhauriyash · 2024-11-19T15:48:11Z

Thank you for clarifying. If I may suggest, it would be very very helpful to include this explicitly in the README or setup documentation. Many researchers with fewer GPUs (like myself on a single GPU), may miss this detail and could unintentionally perform multiple iterations on the same dataset. This could explain issues like the one I encountered in #52 , where only one chunk was being used, leading to repeated data. Switching from fineweb to dclm resolved the problem, likely due to larger individual chunks (I estimate that the subset of dclm I was using had 13B tokens, and my tests are between 1B - 8B tokens). Explicit documentation could prevent similar oversights for others (assuming I did not miss it on my end)

Thanks!

mathuvu · 2024-11-25T11:06:13Z

A paragraph has been added in readme. Thank you for your feedback!

mathuvu closed this as completed Nov 19, 2024

akhauriyash mentioned this issue Nov 24, 2024

lm-eval-harness WikiText bug #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How data is sampled? #55

How data is sampled? #55

macabdul9 commented Nov 14, 2024 •

edited

Loading

mathuvu commented Nov 18, 2024

macabdul9 commented Nov 18, 2024

mathuvu commented Nov 18, 2024 •

edited

Loading

akhauriyash commented Nov 19, 2024 •

edited

Loading

mathuvu commented Nov 25, 2024

How data is sampled? #55

How data is sampled? #55

Comments

macabdul9 commented Nov 14, 2024 • edited Loading

mathuvu commented Nov 18, 2024

macabdul9 commented Nov 18, 2024

mathuvu commented Nov 18, 2024 • edited Loading

akhauriyash commented Nov 19, 2024 • edited Loading

mathuvu commented Nov 25, 2024

macabdul9 commented Nov 14, 2024 •

edited

Loading

mathuvu commented Nov 18, 2024 •

edited

Loading

akhauriyash commented Nov 19, 2024 •

edited

Loading