Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] Inductor failing on training #6988

Closed
ysiraichi opened this issue Apr 28, 2024 · 5 comments
Closed

[torchbench] Inductor failing on training #6988

ysiraichi opened this issue Apr 28, 2024 · 5 comments
Labels

Comments

@ysiraichi
Copy link
Collaborator

🐛 Bug

Using the upstream benchmarking script, inductor training (all models) has been failing for a while for me. I tried creating a new docker environment, but the error didn't seem to be going away. Anyone else?

Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 945, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 941, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 61, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 256, in run_single_config
    metrics, last_output = self.run_once_and_gather_metrics(
  File "xla/benchmarks/experiment_runner.py", line 345, in run_once_and_gather_metrics
    output, _ = loop(iter_fn=self._default_iter_fn)
  File "xla/benchmarks/experiment_runner.py", line 302, in loop
    output, timing, trace = iter_fn(benchmark_experiment, benchmark_model,
  File "xla/benchmarks/experiment_runner.py", line 218, in _default_iter_fn
    output = benchmark_model.model_iter_fn(
  File "torch/_dynamo/eval_frame.py", line 410, in _fn
    return fn(*args, **kwargs)
  File "xla/benchmarks/torchbench_model.py", line 400, in train
    super().train(inputs, collect_full_output=collect_full_output)
  File "xla/benchmarks/benchmark_model.py", line 156, in train
    self._optimizer_zero_grad()
  File "torch/_dynamo/convert_frame.py", line 978, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "torch/_dynamo/convert_frame.py", line 818, in _convert_frame
    result = inner_convert(
  File "torch/_dynamo/convert_frame.py", line 411, in _convert_frame_assert
    return _compile(
  File "torch/_utils_internal.py", line 70, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "torch/_dynamo/convert_frame.py", line 700, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 568, in compile_inner
    out_code = transform_code_object(code, transform)
  File "torch/_dynamo/bytecode_transformation.py", line 1116, in transform_code_object
    transformations(instructions, code_options)
  File "torch/_dynamo/convert_frame.py", line 173, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 515, in transform
    tracer.run()
  File "torch/_dynamo/symbolic_convert.py", line 2237, in run
    super().run()
  File "torch/_dynamo/symbolic_convert.py", line 875, in run
    while self.step():
  File "torch/_dynamo/symbolic_convert.py", line 790, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "torch/_dynamo/symbolic_convert.py", line 490, in wrapper
    return handle_graph_break(self, inst, speculation.reason)
  File "torch/_dynamo/symbolic_convert.py", line 559, in handle_graph_break
    self.output.compile_subgraph(self, reason=reason)
  File "torch/_dynamo/output_graph.py", line 1075, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/usr/local/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "torch/_dynamo/output_graph.py", line 1264, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_dynamo/output_graph.py", line 1331, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "torch/_dynamo/output_graph.py", line 1312, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "torch/_dynamo/repro/after_dynamo.py", line 127, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "torch/__init__.py", line 1742, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/usr/local/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "torch/_inductor/compile_fx.py", line 1398, in compile_fx
    return aot_autograd(
  File "torch/_dynamo/backends/common.py", line 65, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "torch/_functorch/aot_autograd.py", line 958, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_functorch/aot_autograd.py", line 685, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(
  File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 469, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 671, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 282, in aot_dispatch_autograd
    fw_module, bw_module = aot_config.partition_fn(
  File "torch/_inductor/compile_fx.py", line 1339, in partition_fn
    return min_cut_rematerialization_partition(
  File "torch/_functorch/partitioners.py", line 715, in min_cut_rematerialization_partition
    import networkx as nx
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/__init__.py", line 19, in <module>
    from networkx import utils
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/utils/__init__.py", line 7, in <module>
    from networkx.utils.backends import *
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/utils/backends.py", line 258, in <module>
    backends = _get_backends("networkx.backends")
  File "/lib/python3.8/site-packages/networkx-3.3-py3.8.egg/networkx/utils/backends.py", line 234, in _get_backends
    items = entry_points(group=group)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: entry_points() got an unexpected keyword argument 'group'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To Reproduce

I'm running the following command line:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda \
    --test train --xla PJRT --dynamo openxla \
    --repeat 30 --iterations-per-run 1

Environment

cc @miladm @JackCaoG @vanbasten23 @frgossen @cota @golechwierowicz

@ysiraichi
Copy link
Collaborator Author

Haven't tried to run it with PyTorch benchmarking script. Will do so next week.

@miladm
Copy link
Collaborator

miladm commented Apr 29, 2024

@zpcore, have you run into this issue on torchbench auto stack?

@zpcore
Copy link
Collaborator

zpcore commented Apr 29, 2024

Yes, checked our dashboard, most of the tests are failing today (v5p success reduce from 56 -> 11).

Oh, but for the GPU (e.g., H100) run, even though it is not completed. I didn't see the failing yet.

@ysiraichi
Copy link
Collaborator Author

Yes. Interestingly, I did not encounter this issue with L4, only using A100.

@ysiraichi
Copy link
Collaborator Author

@zpcore It looks like it had something to do with: networkx/networkx#7028
Basically, they removed support to Python 3.9, and I was still using 3.8. Moving to 3.10 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants