Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining Cuda Out of Memory Issue #1932

Open
muniefht opened this issue Oct 31, 2024 · 6 comments
Open

Pretraining Cuda Out of Memory Issue #1932

muniefht opened this issue Oct 31, 2024 · 6 comments

Comments

@muniefht
Copy link

I have a device containing 4 Nvidia L40 GPUs. I am trying to use the full_finetune_distributed llama3_1/8B_full recipe. My configuration for dataset in the config file is given below:
dataset:
component: torchtune.datasets.text_completion_dataset
source: "text"
column: "text"
packed: false
split: "train"
data_files: "pretrain-data-batch1-quartered/*.txt"

M
The data is all txt files. Initially I had planned to use 256M tokens to start the pretraining job but I got the Cuda Out Of Memory error. I have now reduced my files to 1/4th and still I am getting the same error for both on full_finetune_distributed as well as on lora_finetune_distributed as well.
I have also reduced my batch size to 1. Still no success.
I have following questions in my mind:

  • Is this error because of some issue with how I have set up the data? My data is all .txt files now about 5k txt files in a single folder. with above config in the yaml file.
  • If I have the files properly set up, is this because my resources are not sufficient? How much resources will I need to pretrain using either full or lora based recepie?
@SalmanMohammadi
Copy link
Collaborator

SalmanMohammadi commented Oct 31, 2024

Hey @muniefht! Great to see you checking out torchtune.

How long are the samples in your dataset? If you haven't set a maximum sequence length in your tokenizer config, you might be filling up GPU memory with quite large sequences - particularly since the model you're using has a maximum sequence length of ~131k, which we'll use if the tokenizer doesn't have a maximum sequence length set.

How much resources will I need to pretrain using either full or lora based recepie?

For reference, you can see some of our benchmarks with the model you're using on different hardware setups here which all use a maximum sequence length of 2048. Another thing to try would be enabling sample packing through dataset.packed. If there's significant variability in the length of samples in your dataset this can boost performance a fair bit.

I'm not 100% sure on your questions about your data setup - these are some things that come immediately to mind. Let me know how you get on with these and we can dig a bit deeper if they don't help.

@felipemello1
Copy link
Contributor

felipemello1 commented Oct 31, 2024

@muniefht , if you are using nightlities and gradient accumulation, you will have OOM issues. This PR fixed it and will land today: #1917

I would also suggest you have the following:
data.packed=True # improves speed greatly, and you wont have memory spikes, because the max_seq_len will be fixed. Needs to set tokenizer.max_seq_len=X
compile=True # speed and memory
activation_checkpointing=True # saves a lot of memory, but its slower
activation_offloading=True # saves a lot of memory, but can be a bit slower. Sometimes it isnt.

If you can fit a high enough batch without using grad_accumulation, you can also set optimizer_in_bwd=True, which saves a lot of memory.

you can read more about these techniques here: https://pytorch.org/torchtune/main/tutorials/memory_optimizations.html

@muniefht
Copy link
Author

muniefht commented Nov 1, 2024

@SalmanMohammadi My dataset is a collection of txt files. Some of them are quite long. I have computed the statistics of the files:
Maximum file length: 768221
Minimum file length: 6
Average file length: 12169.60
I think the larger files are causing troubles? Maybe I need to split the content of bigger files into multiple subfiles because each file is being treated as a sample/row/sequence? and the length is causing the trouble?

@muniefht
Copy link
Author

muniefht commented Nov 1, 2024

Update:
I splitted the content of text files longer than 2048 into multiple sub files. It has started the training. No other parameters were changed. My question is. Will it impact the model performance? The data as you know is just raw unstructured txt files.

@joecummings
Copy link
Contributor

This should have no effect on model training since in text completion everything is just predicting the next token!

I'd recommend looking at the parameter changes suggested by @felipemello1 b/c those will speed up training for you.

@felipemello1
Copy link
Contributor

the dataset shouldnt impact gpu memory, because we don't load the whole dataset in the gpu. We only send the batch to GPU right before the training step. So what I think is happening is that you have some sequence that is very long, and when you try to put this in GPU, you get out of memory (OOM).

If that's the case, then you don't have to manually change the dataset. What you can do is just set tokenizer.max_seq_len=2048 (or some other number).

Let me know if that makes sense. But FYI, by just using the parameters i mentioned, you can go from 70+GiB to ~20GiB, depending on the model size and batch size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants