Torchbench Benchmark Running ERROR #6286

zpcore · 2024-01-10T18:27:49Z

Tested with the lastest commit (235b82b) on TPU V5 and notice the following error:

"/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
WARNING:__main__:Enabling fast F
32 multiplication for PyTorch
Traceback (most recent call last):
  File \"/home/piz/xla/benchmarks/experiment_runner.py\", line 878, in <module>
    main()
  File \"/home/piz/xla/benchmarks/experiment_runner.py\", line 874, in main
    runner.run()
  File \"/home/piz/xla/benc
hmarks/experiment_runner.py\", line 58, in run
    self.run_single_config()
  File \"/home/piz/xla/benchmarks/experiment_runner.py\", line 237, in run_single_config
    reset_rng_state(benchmark_experiment)
  File \"/home/piz/xla/benchmarks/util.py\", line 53, in reset_rng_stat
e
    device = benchmark_experiment.get_device()
  File \"/home/piz/xla/benchmarks/benchmark_experiment.py\", line 163, in get_device
    return xm.xla_device(devkind=self.accelerator.upper())
  File \"/home/piz/.local/lib/python3.10/site-packages/torch_xla/core/xla_model.py\",
 line 207, in xla_device
    return runtime.xla_device(n, devkind)
  File \"/home/piz/.local/lib/python3.10/site-packages/torch_xla/runtime.py\", line 88, in wrapper
    return fn(*args, **kwargs)
  File \"/home/piz/.local/lib/python3.10/site-packages/torch_xla/runtime.py\", li
ne 124, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
WARNING: All
 log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1704829518.293360  143696 pjrt_registry.cc:77] Non-OK-status: pjrt::LoadPjrtPlugin(\"tpu\", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu
*** Be
gin stack trace ***
	tsl::CurrentStackTrace()
	
	torch_xla::runtime::PjRtComputationClient::PjRtComputationClient()
	
	torch_xla::runtime::GetComputationClient()
	torch_xla::bridge::GetDefaultDevice()
	torch_xla::bridge::GetCurrentDevice()
	torch_xla::bridge::GetCur
rentAtenDevice()
	
	
	
	_PyObject_MakeTpCall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	
	Py_FinalizeEx
	Py_RunMain
	Py_BytesMain
	
	__libc_start_main
	_start
*** End stack trace ***

*** Check failure stack trace: ***
    @     0x7fc2fed3ecb9 
 absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7fc2f64ef096  torch_xla::runtime::InitializePjRt()
    @     0x7fc2f6dc87e5  torch_xla::runtime::PjRtComputationClient::PjRtComputationClient()
    @     0x7fc2f6dc186a  torch_xla::runtime::GetCom
putationClient()::{lambda()#1}::operator()()
    @     0x7fc2f6dc1c3c  torch_xla::runtime::GetComputationClient()
    @     0x7fc2f69a7d8d  torch_xla::bridge::GetDefaultDevice()
    @     0x7fc2f69a7ea5  torch_xla::bridge::GetCurrentDevice()
    @     0x7fc2f69ac767  torch_xla:
:bridge::GetCurrentAtenDevice()
    @     0x7fc2f6956195  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7fc2f697502e  pybind11::cpp_function::dispatcher()
    @     0x560b6f01310e  (unknown)
https://symbolize.stripped_domain/r/?trace=7fc3c1a969fc,7fc3
c1a4251f&map= 
*** SIGABRT received by PID 143696 (TID 143696) on cpu 207 from PID 143696; stack trace: ***
PC: @     0x7fc3c1a969fc  (unknown)  pthread_kill
    @     0x7fc2a1101067        928  (unknown)
    @     0x7fc3c1a42520  (unknown)  (unknown)
https://symbolize.strippe
d_domain/r/?trace=7fc3c1a969fc,7fc2a1101066,7fc3c1a4251f&map= 
E0109 19:45:18.426944  143696 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0109 19:45:18.426952  143696 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0109
 19:45:18.426955  143696 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0109 19:45:18.426956  143696 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0109 19:45:18.426967  143696 coredump_hook.cc:546] RAW: Cannot send
 fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0109 19:45:18.426970  143696 coredump_hook.cc:598] RAW: Dumping core locally.
E0109 19:45:18.486146  143696 process_state.cc:807] RAW: Raising signal 6 with default behavior

However, with commit "0857f2a088e9d91be89cf24f33c6564b2e19bc77", there is no issue. The issue is only releated to the code under xla/benchmark/...

Command used:

python experiment_runner.py  --suite-name=torchbench --xla=PJRT --accelerator=tpu --progress-bar --filter BERT_pytorch

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-01-10T18:29:54Z

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
WARNING: All
 log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1704829518.293360  143696 pjrt_registry.cc:77] Non-OK-status: pjrt::LoadPjrtPlugin(\"tpu\", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu
*** Be

@will-cromar what does

Non-OK-status: pjrt::LoadPjrtPlugin(\"tpu\", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu

mean? This seems to just be a setup or hardware issue..

zpcore · 2024-01-10T18:32:34Z

I think it is only related to the benchmark execution script. They probably call the subprocess without inherit the environment variable from parent process where PJRT_DEVICE is set.

By the way, previously we don't need to set PJRT_DEVICE on TPU devices and it will automatically set it to TPU. But recently if we don't set the env, the compiler will report error.

zpcore · 2024-01-10T18:35:21Z

@frgossen and @ysiraichi to see if they have some clues about the issue.

will-cromar · 2024-01-11T19:38:29Z

This error likely means that the script is referencing the XLA device before calling spawn. I have a PR out to improve this error message: #6291

zpcore · 2024-01-24T02:03:00Z

@ysiraichi to this issue.

The issue poped out again. Unluckily, it didn't show the error message as checkin in #6291.

This time I noticed that the issue will continuously happen once the program execution got terminated abnormally (e.g., coredump). Based on the error message RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
Does anyone know how to release the TPU resource from the dangling failure run?

zpcore · 2024-01-24T09:47:38Z

I believe this is due to the subprocess issue, which is related to the #6207. We are kind of missing some code reviews related to the benchmarks/. I will follow up with the issue.

zpcore · 2024-01-24T09:57:02Z

It looks like both the folked child process (https://github.com/pytorch/xla/blob/bc2ebed8dfc63a731c1f3704da0cef0f85f28865/benchmarks/experiment_runner.py#L156C1-L163C12) and root process claimed the PJRT runtime, which results in RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; .

ysiraichi · 2024-01-24T13:04:32Z

When do we actually start the PJRT client runtime? Maybe we could solve this by having an API for clearing the started runtime.

zpcore · 2024-01-24T21:08:28Z

The issue is due to the model was moved into the xla device here: code. Even though we call del benchmark later, the model is still being hold in the device.

ysiraichi · 2024-01-24T21:10:21Z

Maybe #6375 solves this. It makes it so we don't need to load_benchmark() for checking precision.

This was referenced Jan 24, 2024

Use benchmark_cls for checking precision. #6375

Merged

fix subprocess issue with orphaned PJRT loading #6376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchbench Benchmark Running ERROR #6286

Torchbench Benchmark Running ERROR #6286

zpcore commented Jan 10, 2024

JackCaoG commented Jan 10, 2024

zpcore commented Jan 10, 2024 •

edited

Loading

zpcore commented Jan 10, 2024

will-cromar commented Jan 11, 2024

zpcore commented Jan 24, 2024 •

edited

Loading

zpcore commented Jan 24, 2024

zpcore commented Jan 24, 2024

ysiraichi commented Jan 24, 2024

zpcore commented Jan 24, 2024

ysiraichi commented Jan 24, 2024

Torchbench Benchmark Running ERROR #6286

Torchbench Benchmark Running ERROR #6286

Comments

zpcore commented Jan 10, 2024

JackCaoG commented Jan 10, 2024

zpcore commented Jan 10, 2024 • edited Loading

zpcore commented Jan 10, 2024

will-cromar commented Jan 11, 2024

zpcore commented Jan 24, 2024 • edited Loading

zpcore commented Jan 24, 2024

zpcore commented Jan 24, 2024

ysiraichi commented Jan 24, 2024

zpcore commented Jan 24, 2024

ysiraichi commented Jan 24, 2024

zpcore commented Jan 10, 2024 •

edited

Loading

zpcore commented Jan 24, 2024 •

edited

Loading