-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274
Comments
Hi @oroojlooy ! accelerate config And make sure to select multi-node setup! |
Hi @younesbelkada! Do you think the issue is because I have two gpu nodes available in the machine? If so, does setting |
I don't think it is related to accelerate launch, I met the same issue while using GPT2 or GPT2-medium models. |
@bingjie3216 @lvwerra @younesbelkada |
I am also running into this error with reward_summarization.py using the following command:
python=3.10.0, torch=2.0.0, transformers=4.28.1, cuda 12 I enabled anomaly detection and it complained about this line in modeling_gpt2.py line 201: Any suggestions? |
@seirasto |
It looks like we are facing the exact same issue - are you using all the same version of packages? It would be great if someone could share with us ones that work. |
I running it in python 3.8.16 and cuda 11.7. My package versions are:
|
I was able to get around the bug by modifying the problematic line in modeling_gpt2.py to use clone() so no inplace operations are occurring in
|
@seirasto Thanks for letting me know! |
No, but I tried with |
I tried it on everything except |
same error |
I don't have a clear understanding to the cause of this issue per se, but the problem is derived from the fact that we run two forward passes (for
This should work for GPT-2's and GPT-NeoX's! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Just noting that I am also hitting the same in-place issues with the same models, and (very oddly) that is only happens when using DDP. Single-GPU, single-node raises no error. |
Planning to deep dive in the next weeks about issues with respect to distributed training, assigning this to myself |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hi there, I believe this is now fixed on transformers, trl and peft main, please have a look at this comment: #835 (comment) on how to fix the issue |
I am getting the following error traceback when I run
python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
on a machine with two nodes of A10 (24GB). I havetorch==2.0.0
installed.I appreciate any comment/idea to fix that.
The text was updated successfully, but these errors were encountered: