-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #285
Comments
The same issue happens if I try to run the stack llama reward python code: To reproduce: Details: |
I got the same problem here. |
same here. |
Hi everyone, accelerate config follow the instructions, then: accelerate launch reward_summarization.py |
same here. |
You can checkout my workaround here: #274 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Reproduce:
torchrun reward_summarization.py
details:
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the
__call__
method is faster than using a method to encode the text followed by a call to thepad
method to get a padded encoding.Could not estimate the number of tokens of the input, floating-point operations will not be computed
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/jbing-gpu4/code/Users/jbing/code/trl/examples/summarization/scripts/reward_summarization.py", line 202, in
trainer.train(script_args.resume_from_checkpoint)
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/transformers/trainer.py", line 2663, in training_step
loss.backward()
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 448, 448]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
0%| | 0/7255 [00:04<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20477) of binary: /anaconda/envs/rlhf/bin/python
Traceback (most recent call last):
File "/anaconda/envs/rlhf/bin/torchrun", line 8, in
sys.exit(main())
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/anaconda/envs/rlhf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
reward_summarization.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-04-09_04:00:55
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 20477)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: