[torchbench] Training benchmarks failing with: tensor does not require grad #6084

ysiraichi · 2023-12-09T12:20:44Z

This post has two lists of training benchmarks failing with this error in a NVIDIA A100 40GB GPU:

Eager-mode
Dynamo+openxla

These lists were put together by running the upstreamed benchmarking scripts. More specifically, the following command:

python xla/benchmarks/experiment_runner.py \
       --suite-name torchbench \
       --accelerator cuda \
       --xla PJRT --xla None \
       --dynamo openxla --dynamo None \
       --test train \
       --repeat 30 --iterations-per-run 5 \
       --no-resume

Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 601, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 597, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 65, in run
    self.run_single_experiment(experiment_config, model_config)
  File "xla/benchmarks/experiment_runner.py", line 161, in run_single_experiment
    run_metrics, output = self.timed_run(benchmark_experiment,
  File "xla/benchmarks/experiment_runner.py", line 328, in timed_run
    output = loop()
  File "xla/benchmarks/experiment_runner.py", line 310, in loop
    output = benchmark_model.model_iter_fn(
  File "torch/_dynamo/eval_frame.py", line 488, in _fn
    return fn(*args, **kwargs)
  File "xla/benchmarks/torchbench_model.py", line 274, in train
    super().train(inputs, collect_full_output=collect_full_output)
  File "xla/benchmarks/benchmark_model.py", line 142, in train
    self._optimizer_zero_grad()
  File "xla/benchmarks/benchmark_model.py", line 145, in resume_in_train
    loss.backward()
  File "torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Eager-mode

maml

Dynamo+`openxla`

maml
nvidia_deeprecommender

The text was updated successfully, but these errors were encountered:

ysiraichi added the xla:gpu label Dec 9, 2023

ysiraichi mentioned this issue Dec 9, 2023

Failing Torchbench Models: tracking issue #5932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] Training benchmarks failing with: tensor does not require grad #6084

[torchbench] Training benchmarks failing with: tensor does not require grad #6084

ysiraichi commented Dec 9, 2023

[torchbench] Training benchmarks failing with: tensor does not require grad #6084

[torchbench] Training benchmarks failing with: tensor does not require grad #6084

Comments

ysiraichi commented Dec 9, 2023

Eager-mode

Dynamo+openxla

Dynamo+`openxla`