Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation #274

Closed
oroojlooy opened this issue Apr 4, 2023 · 19 comments
Assignees

Comments

@oroojlooy
Copy link
Contributor

I am getting the following error traceback when I run python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16 on a machine with two nodes of A10 (24GB). I have torch==2.0.0 installed.

I appreciate any comment/idea to fix that.

Traceback (most recent call last):
  File "/home/opc/trl/examples/summarization/scripts/reward_summarization.py", line 202, in <module>
    trainer.train(script_args.resume_from_checkpoint)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2663, in training_step
    loss.backward()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABoolType [1, 1, 377, 377]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/opc/trl/examples/summarization/scripts/wandb/offline-run-20230404_175237-0r3498mc
wandb: Find logs at: ./wandb/offline-run-20230404_175237-0r3498mc/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1902146) of binary: /home/opc/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opc/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/opc/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
reward_summarization.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-04_17:52:47
  host      : instance-20230329-1307.subnet03291319.vcn03291319.oraclevcn.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1902146)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@younesbelkada
Copy link
Contributor

Hi @oroojlooy !
Thanks for the issue, I think you should rather run the script with accelerate launch, first run:

accelerate config

And make sure to select multi-node setup!
cc @lvwerra that has some experience in multi-node training using trl

@oroojlooy
Copy link
Contributor Author

Hi @younesbelkada!
I am not using both gpus, so was not sure if I need to utilize accelerate launch. I am getting the error with --nproc_per_node=1.
Also, I got the run command from the README of the corresponding example in TRL package.

Do you think the issue is because I have two gpu nodes available in the machine? If so, does setting CUDA_VISIBLE_DEVICES would help?

@bingjie3216
Copy link

I don't think it is related to accelerate launch, I met the same issue while using GPT2 or GPT2-medium models.

@oroojlooy
Copy link
Contributor Author

@bingjie3216 @lvwerra @younesbelkada
Would you mind sharing the version of python packages (like torch, accelerate, deepspeed, transfomers, etc) that you have and TRL examples work for you?

@seirasto
Copy link

I am also running into this error with reward_summarization.py using the following command:

python -m torch.distributed.run --nproc_per_node=1 /dccstor/srosent2/trl/trl/examples/summarization/scripts/reward_summarization.py --bf16

python=3.10.0, torch=2.0.0, transformers=4.28.1, cuda 12

I enabled anomaly detection and it complained about this line in modeling_gpt2.py

line 201: attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

Any suggestions?

@oroojlooy
Copy link
Contributor Author

@seirasto torch.autograd.set_detect_anomaly(True) gives the same line to me as well.

@seirasto
Copy link

It looks like we are facing the exact same issue - are you using all the same version of packages? It would be great if someone could share with us ones that work.

@oroojlooy
Copy link
Contributor Author

I running it in python 3.8.16 and cuda 11.7. My package versions are:

  • transformers 4.28.1
  • torch 2.0.0
  • trl 0.4.2.dev0

@seirasto
Copy link

I was able to get around the bug by modifying the problematic line in modeling_gpt2.py to use clone() so no inplace operations are occurring in modeling_gpt2.py:

attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value)

@oroojlooy
Copy link
Contributor Author

@seirasto Thanks for letting me know!
Do you have any intuition why the clone() on causal_mask() is required? That does not have any relationship with attn_weights so it seems that it should not affect the gradient in there.

@seirasto
Copy link

No, but I tried with clone() just on attn_weights and it didn't work. I haven't tried doing the clone() for just causal_mask()

@oroojlooy
Copy link
Contributor Author

I tried it on everything except causal_mask() and it did not work. That is why I asked for the intuition about that.

@dayL-W
Copy link

dayL-W commented May 16, 2023

same error

@oliu-io
Copy link

oliu-io commented May 25, 2023

I don't have a clear understanding to the cause of this issue per se, but the problem is derived from the fact that we run two forward passes (for rewards_j and rewards_k respectively) to compute the loss function, and somehow GPT's doesn't like that. Here's a minimal workaround that doesn't involve making changes to transformers.models:

  • Replace the current RewardDataCollatorWithPadding with the following. We merge the two batches into one.
@dataclass
class RewardDataCollatorWithPadding:
    tokenizer: AutoTokenizer
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        merged_features = []
        # features_j = []
        # features_k = []
        for feature in features:
            merged_features.append(
                {
                    "input_ids": feature["input_ids_j"],
                    "attention_mask": feature["attention_mask_j"],
                }
            )
            merged_features.append(
                {
                    "input_ids": feature["input_ids_k"],
                    "attention_mask": feature["attention_mask_k"],
                }
            )
        batch = self.tokenizer.pad(
            merged_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )
        batch = {
            "input_ids": batch["input_ids"],
            "attention_mask": batch["attention_mask"],
            "return_loss": True,
        }
        return batch
  • Replace the current compute_loss with the following. We split model predictions back to rewards_j and rewards_k after a single forward pass, and compute the loss function.
class RewardTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        rewards = model(
            input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
        )[0]
        bsz = rewards.size(0)
        jidx = torch.arange(0, bsz, 2)
        kidx = jidx + 1
        rewards_j = rewards[jidx]
        rewards_k = rewards[kidx]
        loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
        if return_outputs:
            return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
        return loss

This should work for GPT-2's and GPT-NeoX's!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@garrett361
Copy link

Just noting that I am also hitting the same in-place issues with the same models, and (very oddly) that is only happens when using DDP. Single-GPU, single-node raises no error.

@younesbelkada younesbelkada reopened this Oct 10, 2023
@younesbelkada younesbelkada self-assigned this Oct 10, 2023
@younesbelkada
Copy link
Contributor

Planning to deep dive in the next weeks about issues with respect to distributed training, assigning this to myself

Copy link

github-actions bot commented Nov 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@younesbelkada
Copy link
Contributor

Hi there, I believe this is now fixed on transformers, trl and peft main, please have a look at this comment: #835 (comment) on how to fix the issue

@lvwerra lvwerra closed this as completed Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants