[torchbench] Inductor failing on training #6988

ysiraichi · 2024-04-28T14:00:41Z

🐛 Bug

Using the upstream benchmarking script, inductor training (all models) has been failing for a while for me. I tried creating a new docker environment, but the error didn't seem to be going away. Anyone else?

Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 945, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 941, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 61, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 256, in run_single_config
    metrics, last_output = self.run_once_and_gather_metrics(
  File "xla/benchmarks/experiment_runner.py", line 345, in run_once_and_gather_metrics
    output, _ = loop(iter_fn=self._default_iter_fn)
  File "xla/benchmarks/experiment_runner.py", line 302, in loop
    output, timing, trace = iter_fn(benchmark_experiment, benchmark_model,
  File "xla/benchmarks/experiment_runner.py", line 218, in _default_iter_fn
    output = benchmark_model.model_iter_fn(
  File "torch/_dynamo/eval_frame.py", line 410, in _fn
    return fn(*args, **kwargs)
  File "xla/benchmarks/torchbench_model.py", line 400, in train
    super().train(inputs, collect_full_output=collect_full_output)
  File "xla/benchmarks/benchmark_model.py", line 156, in train
    self._optimizer_zero_grad()
  File "torch/_dynamo/convert_frame.py", line 978, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "torch/_dynamo/convert_frame.py", line 818, in _convert_frame
    result = inner_convert(
  File "torch/_dynamo/convert_frame.py", line 411, in _convert_frame_assert
    return _compile(
  File "torch/_utils_internal.py", line 70, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "torch/_dynamo/convert_frame.py", line 700, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 568, in compile_inner
    out_code = transform_code_object(code, transform)
  File "torch/_dynamo/bytecode_transformation.py", line 1116, in transform_code_object
    transformations(instructions, code_options)
  File "torch/_dynamo/convert_frame.py", line 173, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 515, in transform
    tracer.run()
  File "torch/_dynamo/symbolic_convert.py", line 2237, in run
    super().run()
  File "torch/_dynamo/symbolic_convert.py", line 875, in run
    while self.step():
  File "torch/_dynamo/symbolic_convert.py", line 790, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "torch/_dynamo/symbolic_convert.py", line 490, in wrapper
    return handle_graph_break(self, inst, speculation.reason)
  File "torch/_dynamo/symbolic_convert.py", line 559, in handle_graph_break
    self.output.compile_subgraph(self, reason=reason)
  File "torch/_dynamo/output_graph.py", line 1075, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/usr/local/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "torch/_dynamo/output_graph.py", line 1264, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_dynamo/output_graph.py", line 1331, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "torch/_dynamo/output_graph.py", line 1312, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "torch/_dynamo/repro/after_dynamo.py", line 127, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "torch/__init__.py", line 1742, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/usr/local/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "torch/_inductor/compile_fx.py", line 1398, in compile_fx
    return aot_autograd(
  File "torch/_dynamo/backends/common.py", line 65, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "torch/_functorch/aot_autograd.py", line 958, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_functorch/aot_autograd.py", line 685, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(
  File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 469, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 671, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 282, in aot_dispatch_autograd
    fw_module, bw_module = aot_config.partition_fn(
  File "torch/_inductor/compile_fx.py", line 1339, in partition_fn
    return min_cut_rematerialization_partition(
  File "torch/_functorch/partitioners.py", line 715, in min_cut_rematerialization_partition
    import networkx as nx
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/__init__.py", line 19, in <module>
    from networkx import utils
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/utils/__init__.py", line 7, in <module>
    from networkx.utils.backends import *
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/utils/backends.py", line 258, in <module>
    backends = _get_backends("networkx.backends")
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/utils/backends.py", line 234, in _get_backends
    items = entry_points(group=group)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: entry_points() got an unexpected keyword argument 'group'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To Reproduce

I'm running the following command line:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda \
    --test train --xla PJRT --dynamo openxla \
    --repeat 30 --iterations-per-run 1

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
PyTorch commit: f5331aade57725b03c36d5cc6c683f6a6bc0692d
PyTorch/XLA commit: 2a204e9

cc @miladm @JackCaoG @vanbasten23 @frgossen @cota @golechwierowicz

The text was updated successfully, but these errors were encountered:

ysiraichi · 2024-04-28T14:01:10Z

Haven't tried to run it with PyTorch benchmarking script. Will do so next week.

miladm · 2024-04-29T16:36:34Z

@zpcore, have you run into this issue on torchbench auto stack?

zpcore · 2024-04-29T16:50:20Z

Yes, checked our dashboard, most of the tests are failing today (v5p success reduce from 56 -> 11).

Oh, but for the GPU (e.g., H100) run, even though it is not completed. I didn't see the failing yet.

ysiraichi · 2024-04-30T12:48:46Z

Yes. Interestingly, I did not encounter this issue with L4, only using A100.

ysiraichi · 2024-05-13T16:01:35Z

@zpcore It looks like it had something to do with: networkx/networkx#7028
Basically, they removed support to Python 3.9, and I was still using 3.8. Moving to 3.10 fixed this issue.

ysiraichi added the xla:gpu label Apr 28, 2024

ysiraichi mentioned this issue Apr 28, 2024

Failing Torchbench Models: tracking issue #5932

Open

ysiraichi closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] Inductor failing on training #6988

[torchbench] Inductor failing on training #6988

ysiraichi commented Apr 28, 2024

ysiraichi commented Apr 28, 2024

miladm commented Apr 29, 2024

zpcore commented Apr 29, 2024 •

edited

Loading

ysiraichi commented Apr 30, 2024

ysiraichi commented May 13, 2024

[torchbench] Inductor failing on training #6988

[torchbench] Inductor failing on training #6988

Comments

ysiraichi commented Apr 28, 2024

🐛 Bug

To Reproduce

Environment

ysiraichi commented Apr 28, 2024

miladm commented Apr 29, 2024

zpcore commented Apr 29, 2024 • edited Loading

ysiraichi commented Apr 30, 2024

ysiraichi commented May 13, 2024

zpcore commented Apr 29, 2024 •

edited

Loading