[torchbench] `moco` inference and training fail on inductor. #6367

ysiraichi · 2024-01-23T19:37:24Z

🐛 Bug

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda --repeat 2 \
    --test eval --test train --xla None --dynamo inductor \
    -k moco

Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 906, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 902, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 59, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 239, in run_single_config
    benchmark_model = self.model_loader.load_model(model_config,
  File "xla/benchmarks/benchmark_model.py", line 56, in load_model
    benchmark_model.set_up()
  File "xla/benchmarks/torchbench_model.py", line 216, in set_up
    benchmark = self.load_benchmark()
  File "xla/benchmarks/torchbench_model.py", line 261, in load_benchmark
    return benchmark_cls(
  File "torchbenchmark/util/model.py", line 24, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "torchbenchmark/models/moco/__init__.py", line 68, in __init__
    self.model = torch.nn.parallel.DistributedDataParallel(
  File "torch/nn/parallel/distributed.py", line 731, in __init__
    self.process_group = _get_default_group()
  File "torch/distributed/distributed_c10d.py", line 985, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Environment

Reproducible on XLA backend [CPU/TPU]: CUDA
torch_xla version: a7918a7

@miladm @JackCaoG

The text was updated successfully, but these errors were encountered:

ysiraichi · 2024-01-29T17:29:42Z

Fixed by #6375

vanbasten23 · 2024-01-30T01:02:54Z

I wonder how #6375 fixes this error ValueError: Default process group has not been initialized, please make sure to call init_process_group.

ysiraichi · 2024-01-30T12:22:48Z

My guess is: every time moco is instantiated, it creates a distributed process group. Before that PR, we were instantiating it twice (one in the main process, and another in the process that was actually executing it).

ysiraichi added the xla:gpu label Jan 23, 2024

ysiraichi mentioned this issue Jan 29, 2024

Failing Torchbench Models: tracking issue #5932

Open

ysiraichi closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] `moco` inference and training fail on inductor. #6367

[torchbench] `moco` inference and training fail on inductor. #6367

ysiraichi commented Jan 23, 2024

ysiraichi commented Jan 29, 2024

vanbasten23 commented Jan 30, 2024

ysiraichi commented Jan 30, 2024

[torchbench] moco inference and training fail on inductor. #6367

[torchbench] moco inference and training fail on inductor. #6367

Comments

ysiraichi commented Jan 23, 2024

🐛 Bug

Environment

ysiraichi commented Jan 29, 2024

vanbasten23 commented Jan 30, 2024

ysiraichi commented Jan 30, 2024

[torchbench] `moco` inference and training fail on inductor. #6367

[torchbench] `moco` inference and training fail on inductor. #6367