Reward modelling example throws `RuntimeError: Expected to mark a variable ready only once.` when `gradient_checkpointing=True` #831

lewtun · 2023-10-04T08:03:51Z

Running the reward_model.py example on multiple GPUs with gradient checkpointing is throwing the following error:

Traceback (most recent call last):
  File "/fsx/lewis/git/trl/examples/scripts/reward_trainer.py", line 169, in <module>
    trainer.train()
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2787, in training_step
    self.accelerator.backward(loss)
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 1985, in backward
    loss.backward(**kwargs)
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/fsx/lewis/miniconda/envs/trl/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 386 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

To reproduce run:

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml examples/scripts/reward_trainer.py

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-03T15:05:09Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

younesbelkada · 2023-11-03T15:06:46Z

It is now fixed on transformers + peft + trl main, you just need to pass gradient_checkpointing_kwargs={"use_reentrant": False}

wei-ann-Github · 2023-11-21T10:14:05Z

It is now fixed on transformers + peft + trl main, you just need to pass gradient_checkpointing_kwargs={"use_reentrant": False}

Hi, where do we pass this argument? I am facing this issue when using SFTTrainer.

younesbelkada · 2023-11-21T10:26:37Z

Hi @wei-ann-Github
Pass that argument to TrainingArguments note however you need the latest transformers pip install -U transformers

cxjtju · 2024-04-28T12:18:33Z

Hi @wei-ann-Github Pass that argument to TrainingArguments note however you need the latest transformers pip install -U transformers

what's the transformers version? when transformers == 4.32.0, i encoutered dataclasses.FrozenInstanceError: cannot assign to field gradient_checkpointing_kwargs

lewtun mentioned this issue Oct 5, 2023

[RewardTrainer] Enable gradient checkpointing for all multi-GPU training modes #835

Closed

younesbelkada closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward modelling example throws `RuntimeError: Expected to mark a variable ready only once.` when `gradient_checkpointing=True` #831

Reward modelling example throws `RuntimeError: Expected to mark a variable ready only once.` when `gradient_checkpointing=True` #831

lewtun commented Oct 4, 2023

github-actions bot commented Nov 3, 2023

younesbelkada commented Nov 3, 2023

wei-ann-Github commented Nov 21, 2023

younesbelkada commented Nov 21, 2023

cxjtju commented Apr 28, 2024

Reward modelling example throws RuntimeError: Expected to mark a variable ready only once. when gradient_checkpointing=True #831

Reward modelling example throws RuntimeError: Expected to mark a variable ready only once. when gradient_checkpointing=True #831

Comments

lewtun commented Oct 4, 2023

github-actions bot commented Nov 3, 2023

younesbelkada commented Nov 3, 2023

wei-ann-Github commented Nov 21, 2023

younesbelkada commented Nov 21, 2023

cxjtju commented Apr 28, 2024

Reward modelling example throws `RuntimeError: Expected to mark a variable ready only once.` when `gradient_checkpointing=True` #831

Reward modelling example throws `RuntimeError: Expected to mark a variable ready only once.` when `gradient_checkpointing=True` #831