-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretraining Cuda Out of Memory Issue #1932
Comments
Hey @muniefht! Great to see you checking out torchtune. How long are the samples in your dataset? If you haven't set a maximum sequence length in your tokenizer config, you might be filling up GPU memory with quite large sequences - particularly since the model you're using has a maximum sequence length of ~131k, which we'll use if the tokenizer doesn't have a maximum sequence length set.
For reference, you can see some of our benchmarks with the model you're using on different hardware setups here which all use a maximum sequence length of 2048. Another thing to try would be enabling sample packing through dataset.packed. If there's significant variability in the length of samples in your dataset this can boost performance a fair bit. I'm not 100% sure on your questions about your data setup - these are some things that come immediately to mind. Let me know how you get on with these and we can dig a bit deeper if they don't help. |
@muniefht , if you are using nightlities and gradient accumulation, you will have OOM issues. This PR fixed it and will land today: #1917 I would also suggest you have the following: If you can fit a high enough batch without using grad_accumulation, you can also set optimizer_in_bwd=True, which saves a lot of memory. you can read more about these techniques here: https://pytorch.org/torchtune/main/tutorials/memory_optimizations.html |
@SalmanMohammadi My dataset is a collection of txt files. Some of them are quite long. I have computed the statistics of the files: |
Update: |
This should have no effect on model training since in text completion everything is just predicting the next token! I'd recommend looking at the parameter changes suggested by @felipemello1 b/c those will speed up training for you. |
the dataset shouldnt impact gpu memory, because we don't load the whole dataset in the gpu. We only send the batch to GPU right before the training step. So what I think is happening is that you have some sequence that is very long, and when you try to put this in GPU, you get out of memory (OOM). If that's the case, then you don't have to manually change the dataset. What you can do is just set tokenizer.max_seq_len=2048 (or some other number). Let me know if that makes sense. But FYI, by just using the parameters i mentioned, you can go from 70+GiB to ~20GiB, depending on the model size and batch size |
I have a device containing 4 Nvidia L40 GPUs. I am trying to use the full_finetune_distributed llama3_1/8B_full recipe. My configuration for dataset in the config file is given below:
dataset:
component: torchtune.datasets.text_completion_dataset
source: "text"
column: "text"
packed: false
split: "train"
data_files: "pretrain-data-batch1-quartered/*.txt"
M
The data is all txt files. Initially I had planned to use 256M tokens to start the pretraining job but I got the Cuda Out Of Memory error. I have now reduced my files to 1/4th and still I am getting the same error for both on full_finetune_distributed as well as on lora_finetune_distributed as well.
I have also reduced my batch size to 1. Still no success.
I have following questions in my mind:
The text was updated successfully, but these errors were encountered: