RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

oroojlooy · 2023-04-04T18:01:48Z

I am getting the following error traceback when I run python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16 on a machine with two nodes of A10 (24GB). I have torch==2.0.0 installed.

I appreciate any comment/idea to fix that.

Traceback (most recent call last):
  File "/home/opc/trl/examples/summarization/scripts/reward_summarization.py", line 202, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2663, in training_step
    loss.backward()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 377, 377]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/opc/trl/examples/summarization/scripts/wandb/offline-run-20230404_175237-0r3498mc
wandb: Find logs at: ./wandb/offline-run-20230404_175237-0r3498mc/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1902146) of binary: /home/opc/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
reward_summarization.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-04_17:52:47
  host      : instance-20230329-1307.subnet03291319.vcn03291319.oraclevcn.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1902146)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-04-04T18:05:27Z

Hi @oroojlooy !
Thanks for the issue, I think you should rather run the script with accelerate launch, first run:

accelerate config

And make sure to select multi-node setup!
cc @lvwerra that has some experience in multi-node training using trl

oroojlooy · 2023-04-04T18:40:15Z

Hi @younesbelkada!
I am not using both gpus, so was not sure if I need to utilize accelerate launch. I am getting the error with --nproc_per_node=1.
Also, I got the run command from the README of the corresponding example in TRL package.

Do you think the issue is because I have two gpu nodes available in the machine? If so, does setting CUDA_VISIBLE_DEVICES would help?

bingjie3216 · 2023-04-09T18:10:16Z

I don't think it is related to accelerate launch, I met the same issue while using GPT2 or GPT2-medium models.

oroojlooy · 2023-04-20T17:17:49Z

@bingjie3216 @lvwerra @younesbelkada
Would you mind sharing the version of python packages (like torch, accelerate, deepspeed, transfomers, etc) that you have and TRL examples work for you?

seirasto · 2023-04-25T18:48:50Z

I am also running into this error with reward_summarization.py using the following command:

python -m torch.distributed.run --nproc_per_node=1 /dccstor/srosent2/trl/trl/examples/summarization/scripts/reward_summarization.py --bf16

python=3.10.0, torch=2.0.0, transformers=4.28.1, cuda 12

I enabled anomaly detection and it complained about this line in modeling_gpt2.py

line 201: attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

Any suggestions?

oroojlooy · 2023-04-25T18:52:14Z

@seirasto torch.autograd.set_detect_anomaly(True) gives the same line to me as well.

seirasto · 2023-04-25T19:01:24Z

It looks like we are facing the exact same issue - are you using all the same version of packages? It would be great if someone could share with us ones that work.

oroojlooy · 2023-04-25T22:13:21Z

I running it in python 3.8.16 and cuda 11.7. My package versions are:

transformers 4.28.1
torch 2.0.0
trl 0.4.2.dev0

seirasto · 2023-04-26T01:50:17Z

I was able to get around the bug by modifying the problematic line in modeling_gpt2.py to use clone() so no inplace operations are occurring in modeling_gpt2.py:

attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value)

oroojlooy · 2023-04-26T12:44:59Z

@seirasto Thanks for letting me know!
Do you have any intuition why the clone() on causal_mask() is required? That does not have any relationship with attn_weights so it seems that it should not affect the gradient in there.

seirasto · 2023-04-26T13:28:09Z

No, but I tried with clone() just on attn_weights and it didn't work. I haven't tried doing the clone() for just causal_mask()

oroojlooy · 2023-04-26T16:07:39Z

I tried it on everything except causal_mask() and it did not work. That is why I asked for the intuition about that.

dayL-W · 2023-05-16T07:56:48Z

same error

oliu-io · 2023-05-25T01:32:41Z

I don't have a clear understanding to the cause of this issue per se, but the problem is derived from the fact that we run two forward passes (for rewards_j and rewards_k respectively) to compute the loss function, and somehow GPT's doesn't like that. Here's a minimal workaround that doesn't involve making changes to transformers.models:

Replace the current RewardDataCollatorWithPadding with the following. We merge the two batches into one.

@dataclass
class RewardDataCollatorWithPadding:
    tokenizer: AutoTokenizer
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        merged_features = []
        # features_j = []
        # features_k = []
        for feature in features:
            merged_features.append(
                {
                    "input_ids": feature["input_ids_j"],
                    "attention_mask": feature["attention_mask_j"],
                }
            )
            merged_features.append(
                {
                    "input_ids": feature["input_ids_k"],
                    "attention_mask": feature["attention_mask_k"],
                }
            )
        batch = self.tokenizer.pad(
            merged_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )
        batch = {
            "input_ids": batch["input_ids"],
            "attention_mask": batch["attention_mask"],
            "return_loss": True,
        }
        return batch

Replace the current compute_loss with the following. We split model predictions back to rewards_j and rewards_k after a single forward pass, and compute the loss function.

class RewardTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        rewards = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )[0]
        bsz = rewards.size(0)
        jidx = torch.arange(0, bsz, 2)
        kidx = jidx + 1
        rewards_j = rewards[jidx]
        rewards_k = rewards[kidx]
        loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
        if return_outputs:
            return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
        return loss

This should work for GPT-2's and GPT-NeoX's!

github-actions · 2023-06-20T15:04:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

garrett361 · 2023-09-27T15:00:06Z

Just noting that I am also hitting the same in-place issues with the same models, and (very oddly) that is only happens when using DDP. Single-GPU, single-node raises no error.

younesbelkada · 2023-10-10T17:58:23Z

Planning to deep dive in the next weeks about issues with respect to distributed training, assigning this to myself

github-actions · 2023-11-04T15:05:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

younesbelkada · 2023-11-04T21:36:36Z

Hi there, I believe this is now fixed on transformers, trl and peft main, please have a look at this comment: #835 (comment) on how to fix the issue

This was referenced May 25, 2023

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #285

Closed

[StackLLaMA] Problems running reward_modeling.py using gpt2 as base for reward model #356

Closed

github-actions bot closed this as completed Jun 28, 2023

younesbelkada reopened this Oct 10, 2023

younesbelkada self-assigned this Oct 10, 2023

lvwerra closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

oroojlooy commented Apr 4, 2023

younesbelkada commented Apr 4, 2023

oroojlooy commented Apr 4, 2023

bingjie3216 commented Apr 9, 2023

oroojlooy commented Apr 20, 2023

seirasto commented Apr 25, 2023

oroojlooy commented Apr 25, 2023

seirasto commented Apr 25, 2023

oroojlooy commented Apr 25, 2023

seirasto commented Apr 26, 2023

oroojlooy commented Apr 26, 2023

seirasto commented Apr 26, 2023

oroojlooy commented Apr 26, 2023

dayL-W commented May 16, 2023

oliu-io commented May 25, 2023 •

edited

Loading

github-actions bot commented Jun 20, 2023

garrett361 commented Sep 27, 2023

younesbelkada commented Oct 10, 2023

github-actions bot commented Nov 4, 2023

younesbelkada commented Nov 4, 2023

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

Comments

oroojlooy commented Apr 4, 2023

younesbelkada commented Apr 4, 2023

oroojlooy commented Apr 4, 2023

bingjie3216 commented Apr 9, 2023

oroojlooy commented Apr 20, 2023

seirasto commented Apr 25, 2023

oroojlooy commented Apr 25, 2023

seirasto commented Apr 25, 2023

oroojlooy commented Apr 25, 2023

seirasto commented Apr 26, 2023

oroojlooy commented Apr 26, 2023

seirasto commented Apr 26, 2023

oroojlooy commented Apr 26, 2023

dayL-W commented May 16, 2023

oliu-io commented May 25, 2023 • edited Loading

github-actions bot commented Jun 20, 2023

garrett361 commented Sep 27, 2023

younesbelkada commented Oct 10, 2023

github-actions bot commented Nov 4, 2023

younesbelkada commented Nov 4, 2023

oliu-io commented May 25, 2023 •

edited

Loading