[torchbench] `moco` fails to run with CUDA OpenXLA fallback. #7647

ysiraichi · 2024-07-09T17:48:09Z

🐛 Bug

Running the upstreamed benchmarking scripts with the following command results in an unexpected error. It does work when using CPU OpenXLA fallback, though.

python xla/benchmarks/experiment_runner.py \
       --suite-name torchbench \
       --accelerator cuda \
       --xla PJRT \
       --dynamo None \
       --test eval \
       --repeat 30 --iterations-per-run 5 \
       --print-subprocess \
       --no-resume --filter moco

[rank0]: Traceback (most recent call last):
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 1011, in <module>
[rank0]:     main()
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 1007, in main
[rank0]:     runner.run()
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 73, in run
[rank0]:     self.run_single_config()
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 278, in run_single_config
[rank0]:     metrics, last_output = self.run_once_and_gather_metrics(
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 374, in run_once_and_gather_metrics
[rank0]:     output, _ = loop(iter_fn=self._default_iter_fn)
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 331, in loop
[rank0]:     output, timing, trace = iter_fn(benchmark_experiment, benchmark_model,
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 244, in _default_iter_fn
[rank0]:     self._mark_step(benchmark_experiment, output)
[rank0]:   File "xla/benchmarks/experiment_runner.py", line 456, in _mark_step
[rank0]:     xm.mark_step()
[rank0]:   File "xla/torch_xla/core/xla_model.py", line 1056, in mark_step
[rank0]:     torch_xla._XLAC._xla_step_marker(
[rank0]: RuntimeError: ./torch_xla/csrc/runtime/pjrt_computation_client.h:192 : Check failed: HasValue()
[rank0]: *** Begin stack trace ***
[rank0]:        tsl::CurrentStackTrace[abi:cxx11]()
[rank0]:        torch_xla::runtime::PjRtComputationClient::PjRtData::GetHandle()
[rank0]:        torch::lazy::LazyGraphExecutor::RunPostOrder(std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&, torch::lazy::LazyGraphExecutor::SyncTensorCollection*)
[rank0]:        torch_xla::XLAGraphExecutor::RunPostOrder(std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&, torch::lazy::LazyGraphExecutor::SyncTensorCollection*)
[rank0]:        torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
[rank0]:        torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, bool, bool, bool)
[rank0]:        torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool)
[rank0]:
[rank0]:
[rank0]:
[rank0]:
[rank0]:
[rank0]:
[rank0]:
[rank0]:
[rank0]:        _PyObject_MakeTpCall
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        _PyEval_EvalFrameDefault
[rank0]:
[rank0]:        PyEval_EvalCode
[rank0]:
[rank0]:
[rank0]:
[rank0]:        _PyRun_SimpleFileObject
[rank0]:        _PyRun_AnyFileObject
[rank0]:        Py_RunMain
[rank0]:        Py_BytesMain
[rank0]:        __libc_start_main
[rank0]:        _start
[rank0]: *** End stack trace ***
[rank0]: buffer with shape s64[1] on device CUDA:0 is deleted

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
torch_xla version: c782e0d

cc @miladm @JackCaoG

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-07-09T18:09:22Z

seems like during mark_step we found a XLATensor with empty data handle

ysiraichi · 2024-07-09T18:42:20Z

As far as I have investigated, the only fallback we are running differently is aten::_local_scalar_dense.

JackCaoG · 2024-07-09T18:47:22Z

_local_scalar_dense should be run on CPU I guess? This op usually happens when we move the tensor to CPU for print.

ysiraichi · 2024-07-09T18:51:44Z

Right. But I wonder whether this issue sheds light into a CUDA OpenXLA fallback implementation issue. In the sense that, even if we run that on CUDA, it should still work.

ysiraichi · 2024-07-09T21:28:12Z

This is odd. I tried replacing the DLPack conversion with tensor.to("cpu").to("cuda") and tensor.to("cpu").to("xla"), and still got the same error.

ysiraichi · 2024-07-09T21:29:04Z

Forcing CPU fallback on _local_scalar_dense did work, though.

ysiraichi · 2024-07-16T17:50:39Z

@JackCaoG
I have been debugging this for a while, now. And here's what I found out:

The PjRtData that was deleted is not the same as the fallback input holds. It was created in a later mark_step() call
PjRtData instantiation: it is first instantiated by a CreateDataPlaceholder call, inside ExtractIRAndPrepareXlaData_ function, (as far as I understand) when mark_step() is called.
PjRtStreamExecutorBuffer deletion: Delete calls Release after RunPostOrder finishes. That said, I believe that, at that point, the buffer is already deleted (i.e. PjRtStreamExecutorBuffer::IsDeleted() == true). The reason being that PjRtStreamExecutorBuffer::ConfirmDonation is called before.

Basically, this is the timeline I am seeing:

...
CreateDataPlaceholder(tensor: 0x55a254171e70)
XLAData (ptr: 0x55a254142e60):
  Data Device: CUDA:0
  Data Shape: s64[1]
  Data Handle: None
...
PjRtData::Assign: Handle changes from None to 0x7fecfc0710a0
  >> Old: XLAData (0x55a254142e60):
  Data Device: CUDA:0
  Data Shape: s64[1]
  Data Handle: None

  >> New: XLAData (0x7fecfc677340):
  Data Device: CUDA:0
  Data Shape: s64[1]
  Data Handle: 0x7fecfc0710a0
...
PjRtStreamExecutorBuffer::GetBufferWithHold(Usage): 0x7fecfc0710a0
...
PjRtStreamExecutorBuffer::GetBufferWithHold(Donation): 0x7fecfc0710a0
...
PjRtStreamExecutorBuffer::ConfirmDonation: 0x7fecfc0710a0
  >> Resets the buffer, i.e. deletes it!
...
Could NOT get handle (0x55a254142e60): XLAData:
  Data Device: CUDA:0
  Data Shape: s64[1]
  Data Handle: Deleted

PjRtStreamExecutorBuffer::Delete: 0x7fecfc0710a0
  >> Delete is called, but buffer is already deleted, i.e. `PjRtStreamExecutorBuffer::device_buffer_ == nullptr`
...

Do you see anything strange? Any ideas of where to look at?

ysiraichi · 2024-07-23T15:23:34Z

In an external discussion, we decided to work around this issue for now by forcing aten::_local_scalar_dense to be run on CPU. Since this isn't exactly fixed (i.e. it may actually be the symptom of a more complex hidden error), I won't close this issue.

ysiraichi added the xla:gpu label Jul 9, 2024

ysiraichi mentioned this issue Jul 9, 2024

Make CUDA OpenXLA fallback the default. #7630

Merged

ysiraichi mentioned this issue Jul 15, 2024

Failing Torchbench Models: tracking issue #5932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] `moco` fails to run with CUDA OpenXLA fallback. #7647

[torchbench] `moco` fails to run with CUDA OpenXLA fallback. #7647

ysiraichi commented Jul 9, 2024

JackCaoG commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

JackCaoG commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

ysiraichi commented Jul 16, 2024

ysiraichi commented Jul 23, 2024

[torchbench] moco fails to run with CUDA OpenXLA fallback. #7647

[torchbench] moco fails to run with CUDA OpenXLA fallback. #7647

Comments

ysiraichi commented Jul 9, 2024

🐛 Bug

Environment

JackCaoG commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

JackCaoG commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

ysiraichi commented Jul 9, 2024

ysiraichi commented Jul 16, 2024

ysiraichi commented Jul 23, 2024

[torchbench] `moco` fails to run with CUDA OpenXLA fallback. #7647

[torchbench] `moco` fails to run with CUDA OpenXLA fallback. #7647