`test_trace_and_metrics` fail if PyTorch has CUDA support. #6292

ysiraichi · 2024-01-11T03:11:50Z

🐛 Bug

PJRT_DEVICE=CUDA python test/test_profiler.py

The command above fails with:

2024-01-11 02:59:59.706960: I torch_xla/csrc/runtime/pjrt_computation_client.cc:167] Initializing PjRt GPU client...
2024-01-11 02:59:59.707055: I torch_xla/csrc/runtime/pjrt_computation_client.cc:200] Getting StreamExecutorGpuClient for node_id=0, num_nodes=1
2024-01-11 02:59:59.730538: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2024-01-11 02:59:59.730584: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:137] retrieving CUDA diagnostic information for host: qgpu3
2024-01-11 02:59:59.730600: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:144] hostname: qgpu3
2024-01-11 02:59:59.730688: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:168] libcuda reported version is: 530.30.2
2024-01-11 02:59:59.730731: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:172] kernel reported version is: 530.30.2
2024-01-11 02:59:59.730745: I external/xla/xla/stream_executor/cuda/cuda_diagnostics.cc:253] kernel version seems to match DSO: 530.30.2
Process Process-33:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "test/test_profiler.py", line 74, in train_worker
    test_profile_mp_mnist.train_mnist(
  File "xla/test/test_profile_mp_mnist.py", line 102, in train_mnist
    sample_count=600000 // flags.batch_size // xm.xrt_world_size())
  File "xla/torch_xla/core/xla_model.py", line 127, in xrt_world_size
    return runtime.world_size()
  File "xla/torch_xla/runtime.py", line 87, in wrapper
    return fn(*args, **kwargs)
  File "xla/torch_xla/runtime.py", line 149, in world_size
    if torch_xla._XLAC._xla_get_replication_devices_count() == 0:
RuntimeError: Bad StatusOr access: FAILED_PRECONDITION: No visible GPU devices.

Environment

Reproducible on XLA backend [CPU/TPU]: CUDA
PyTorch/XLA: Enable CUDA support for PyTorch on CI. #6070 (base: 0e735de)

Additional context

This feels similar to the issue solved by #5960 in the benchmarking scripts. Basically, we are initializing CUDA, and then forking with multiprocess.

cc @miladm @JackCaoG

The text was updated successfully, but these errors were encountered:

ysiraichi · 2024-01-11T03:14:20Z

One way around this issue could be to use subprocess library instead (same solution as #5960). What do you think?

ysiraichi added the xla:gpu label Jan 11, 2024

This was referenced Jan 12, 2024

Use spawn as the fork method for the profiler test. #6302

Merged

Failing Torchbench Models: tracking issue #5932

Open

ysiraichi closed this as completed in #6302 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_trace_and_metrics` fail if PyTorch has CUDA support. #6292

`test_trace_and_metrics` fail if PyTorch has CUDA support. #6292

ysiraichi commented Jan 11, 2024 •

edited

Loading

ysiraichi commented Jan 11, 2024

test_trace_and_metrics fail if PyTorch has CUDA support. #6292

test_trace_and_metrics fail if PyTorch has CUDA support. #6292

Comments

ysiraichi commented Jan 11, 2024 • edited Loading

🐛 Bug

Environment

Additional context

ysiraichi commented Jan 11, 2024

`test_trace_and_metrics` fail if PyTorch has CUDA support. #6292

`test_trace_and_metrics` fail if PyTorch has CUDA support. #6292

ysiraichi commented Jan 11, 2024 •

edited

Loading