Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] moco inference and training fail on inductor. #6367

Closed
ysiraichi opened this issue Jan 23, 2024 · 3 comments
Closed

[torchbench] moco inference and training fail on inductor. #6367

ysiraichi opened this issue Jan 23, 2024 · 3 comments
Labels

Comments

@ysiraichi
Copy link
Collaborator

🐛 Bug

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda --repeat 2 \
    --test eval --test train --xla None --dynamo inductor \
    -k moco
Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 906, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 902, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 59, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 239, in run_single_config
    benchmark_model = self.model_loader.load_model(model_config,
  File "xla/benchmarks/benchmark_model.py", line 56, in load_model
    benchmark_model.set_up()
  File "xla/benchmarks/torchbench_model.py", line 216, in set_up
    benchmark = self.load_benchmark()
  File "xla/benchmarks/torchbench_model.py", line 261, in load_benchmark
    return benchmark_cls(
  File "torchbenchmark/util/model.py", line 24, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "torchbenchmark/models/moco/__init__.py", line 68, in __init__
    self.model = torch.nn.parallel.DistributedDataParallel(
  File "torch/nn/parallel/distributed.py", line 731, in __init__
    self.process_group = _get_default_group()
  File "torch/distributed/distributed_c10d.py", line 985, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Environment

  • Reproducible on XLA backend [CPU/TPU]: CUDA
  • torch_xla version: a7918a7

@miladm @JackCaoG

@ysiraichi
Copy link
Collaborator Author

Fixed by #6375

@vanbasten23
Copy link
Collaborator

I wonder how #6375 fixes this error ValueError: Default process group has not been initialized, please make sure to call init_process_group.

@ysiraichi
Copy link
Collaborator Author

My guess is: every time moco is instantiated, it creates a distributed process group. Before that PR, we were instantiating it twice (one in the main process, and another in the process that was actually executing it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants