Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How data is sampled? #55

Closed
macabdul9 opened this issue Nov 14, 2024 · 5 comments
Closed

How data is sampled? #55

macabdul9 opened this issue Nov 14, 2024 · 5 comments

Comments

@macabdul9
Copy link

macabdul9 commented Nov 14, 2024

I want to train a model for one pass (1 epoch) on fineweb_edu_10bt_shuffled. Should I pass nSteps in the config which is:
nSteps = nExamples // (batch_size * nDevice * nAccumulation) if not how else I can ensure my model is trained on all of fineweb_edu_10bt_shuffled for at least 1 epoch (exactly).

@mathuvu
Copy link
Contributor

mathuvu commented Nov 18, 2024

When preparing the data using our setup script, download_prepare_hf_data.py, there is an nchunks parameter that splits the dataset into nchunks .jsonl files (default: 32). Each GPU reads one of these files; for example, GPU 0 reads file 0, GPU 1 reads file 1, and so on. If the number of GPUs exceeds the number of files, the file index is determined using modulo. To utilize the full dataset with the default configuration, you’ll need at least 32 GPUs. Additionally, to avoid oversampling certain files, the number of GPUs should ideally be a multiple of nchunks.

Addressing your question about adding an epochs parameter, your calculation is close, but you need to account for the modulo when using more than 32 GPUs. The adjusted formula would be:nSteps = total_tokens // (batch_size * seqlen * nDevice * nAccumulation * (nDevice / nchunks) with total_tokens the total number of token in your dataset.

This calculation assumes a single data source (as in the default configuration). Keep in mind that some chunks may contain more tokens than others, as each line in a .jsonl file corresponds to a document rather than a fixed number of tokens

Lastly, for each data source, you can track how many times the data loader has looped over each file by examining the current_iter variable. This state is saved in the training checkpoint files (train_state_*.json).

@macabdul9
Copy link
Author

Thanks @mathuvu for your comment.

I have a follow-up.

So if I have nDevices << nchunks then the seen nchunks (number .jsonl files) would be just nDevices and training will loop over on portion of these chunks (== nDevices) defined in config until it completes nSteps?

data:
  root_dir: data/
  sources:
    fineweb_edu_10bt_shuffled: 1.0

If so - the best way to ensure that model is seeing all of fineweb_edu_10bt data (or any other data) is to process it first into nChunks same as nDevices? Can you please confirm? Thanks in advance.

@mathuvu
Copy link
Contributor

mathuvu commented Nov 18, 2024

So if I have nDevices << nchunks then the seen nchunks (number .jsonl files) would be just nDevices and training will loop over on portion of these chunks (== nDevices) defined in config until it completes nSteps?

Yes, it will loop on a subset of the data

If so - the best way to ensure that model is seeing all of fineweb_edu_10bt data (or any other data) is to process it first into nChunks same as nDevices?

You can put nchunks to 1 if you want to make things easier or nchunks = nDevices.

@mathuvu mathuvu closed this as completed Nov 19, 2024
@akhauriyash
Copy link

akhauriyash commented Nov 19, 2024

Thank you for clarifying. If I may suggest, it would be very very helpful to include this explicitly in the README or setup documentation. Many researchers with fewer GPUs (like myself on a single GPU), may miss this detail and could unintentionally perform multiple iterations on the same dataset. This could explain issues like the one I encountered in #52 , where only one chunk was being used, leading to repeated data. Switching from fineweb to dclm resolved the problem, likely due to larger individual chunks (I estimate that the subset of dclm I was using had 13B tokens, and my tests are between 1B - 8B tokens). Explicit documentation could prevent similar oversights for others (assuming I did not miss it on my end)

Thanks!

@mathuvu
Copy link
Contributor

mathuvu commented Nov 25, 2024

A paragraph has been added in readme. Thank you for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants