Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] Training benchmarks failing with: tensor does not require grad #6084

Open
3 tasks
ysiraichi opened this issue Dec 9, 2023 · 0 comments
Open
3 tasks
Labels

Comments

@ysiraichi
Copy link
Collaborator

This post has two lists of training benchmarks failing with this error in a NVIDIA A100 40GB GPU:

  • Eager-mode
  • Dynamo+openxla

These lists were put together by running the upstreamed benchmarking scripts. More specifically, the following command:

python xla/benchmarks/experiment_runner.py \
       --suite-name torchbench \
       --accelerator cuda \
       --xla PJRT --xla None \
       --dynamo openxla --dynamo None \
       --test train \
       --repeat 30 --iterations-per-run 5 \
       --no-resume
Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 601, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 597, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 65, in run
    self.run_single_experiment(experiment_config, model_config)
  File "xla/benchmarks/experiment_runner.py", line 161, in run_single_experiment
    run_metrics, output = self.timed_run(benchmark_experiment,
  File "xla/benchmarks/experiment_runner.py", line 328, in timed_run
    output = loop()
  File "xla/benchmarks/experiment_runner.py", line 310, in loop
    output = benchmark_model.model_iter_fn(
  File "torch/_dynamo/eval_frame.py", line 488, in _fn
    return fn(*args, **kwargs)
  File "xla/benchmarks/torchbench_model.py", line 274, in train
    super().train(inputs, collect_full_output=collect_full_output)
  File "xla/benchmarks/benchmark_model.py", line 142, in train
    self._optimizer_zero_grad()
  File "xla/benchmarks/benchmark_model.py", line 145, in resume_in_train
    loss.backward()
  File "torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Eager-mode

  • maml

Dynamo+openxla

  • maml
  • nvidia_deeprecommender
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant