-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How data is sampled? #55
Comments
When preparing the data using our setup script, download_prepare_hf_data.py, there is an nchunks parameter that splits the dataset into nchunks .jsonl files (default: 32). Each GPU reads one of these files; for example, GPU 0 reads file 0, GPU 1 reads file 1, and so on. If the number of GPUs exceeds the number of files, the file index is determined using modulo. To utilize the full dataset with the default configuration, you’ll need at least 32 GPUs. Additionally, to avoid oversampling certain files, the number of GPUs should ideally be a multiple of nchunks. Addressing your question about adding an epochs parameter, your calculation is close, but you need to account for the modulo when using more than 32 GPUs. The adjusted formula would be: This calculation assumes a single data source (as in the default configuration). Keep in mind that some chunks may contain more tokens than others, as each line in a .jsonl file corresponds to a document rather than a fixed number of tokens Lastly, for each data source, you can track how many times the data loader has looped over each file by examining the current_iter variable. This state is saved in the training checkpoint files (train_state_*.json). |
Thanks @mathuvu for your comment. I have a follow-up. So if I have
If so - the best way to ensure that model is seeing all of |
Yes, it will loop on a subset of the data
You can put |
Thank you for clarifying. If I may suggest, it would be very very helpful to include this explicitly in the README or setup documentation. Many researchers with fewer GPUs (like myself on a single GPU), may miss this detail and could unintentionally perform multiple iterations on the same dataset. This could explain issues like the one I encountered in #52 , where only one chunk was being used, leading to repeated data. Switching from fineweb to dclm resolved the problem, likely due to larger individual chunks (I estimate that the subset of dclm I was using had 13B tokens, and my tests are between 1B - 8B tokens). Explicit documentation could prevent similar oversights for others (assuming I did not miss it on my end) Thanks! |
A paragraph has been added in readme. Thank you for your feedback! |
I want to train a model for one pass (1 epoch) on fineweb_edu_10bt_shuffled. Should I pass nSteps in the config which is:
nSteps = nExamples // (batch_size * nDevice * nAccumulation)
if not how else I can ensure my model is trained on all of fineweb_edu_10bt_shuffled for at least 1 epoch (exactly).The text was updated successfully, but these errors were encountered: