Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchbench Benchmark Running ERROR #6286

Open
zpcore opened this issue Jan 10, 2024 · 10 comments
Open

Torchbench Benchmark Running ERROR #6286

zpcore opened this issue Jan 10, 2024 · 10 comments

Comments

@zpcore
Copy link
Collaborator

zpcore commented Jan 10, 2024

Tested with the lastest commit (235b82b) on TPU V5 and notice the following error:

"/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
WARNING:__main__:Enabling fast F
32 multiplication for PyTorch
Traceback (most recent call last):
  File \"/home/piz/xla/benchmarks/experiment_runner.py\", line 878, in <module>
    main()
  File \"/home/piz/xla/benchmarks/experiment_runner.py\", line 874, in main
    runner.run()
  File \"/home/piz/xla/benc
hmarks/experiment_runner.py\", line 58, in run
    self.run_single_config()
  File \"/home/piz/xla/benchmarks/experiment_runner.py\", line 237, in run_single_config
    reset_rng_state(benchmark_experiment)
  File \"/home/piz/xla/benchmarks/util.py\", line 53, in reset_rng_stat
e
    device = benchmark_experiment.get_device()
  File \"/home/piz/xla/benchmarks/benchmark_experiment.py\", line 163, in get_device
    return xm.xla_device(devkind=self.accelerator.upper())
  File \"/home/piz/.local/lib/python3.10/site-packages/torch_xla/core/xla_model.py\",
 line 207, in xla_device
    return runtime.xla_device(n, devkind)
  File \"/home/piz/.local/lib/python3.10/site-packages/torch_xla/runtime.py\", line 88, in wrapper
    return fn(*args, **kwargs)
  File \"/home/piz/.local/lib/python3.10/site-packages/torch_xla/runtime.py\", li
ne 124, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
WARNING: All
 log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1704829518.293360  143696 pjrt_registry.cc:77] Non-OK-status: pjrt::LoadPjrtPlugin(\"tpu\", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu
*** Be
gin stack trace ***
	tsl::CurrentStackTrace()
	
	torch_xla::runtime::PjRtComputationClient::PjRtComputationClient()
	
	torch_xla::runtime::GetComputationClient()
	torch_xla::bridge::GetDefaultDevice()
	torch_xla::bridge::GetCurrentDevice()
	torch_xla::bridge::GetCur
rentAtenDevice()
	
	
	
	_PyObject_MakeTpCall
	_PyEval_EvalFrameDefault
	_PyFunction_Vectorcall
	
	Py_FinalizeEx
	Py_RunMain
	Py_BytesMain
	
	__libc_start_main
	_start
*** End stack trace ***

*** Check failure stack trace: ***
    @     0x7fc2fed3ecb9 
 absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7fc2f64ef096  torch_xla::runtime::InitializePjRt()
    @     0x7fc2f6dc87e5  torch_xla::runtime::PjRtComputationClient::PjRtComputationClient()
    @     0x7fc2f6dc186a  torch_xla::runtime::GetCom
putationClient()::{lambda()#1}::operator()()
    @     0x7fc2f6dc1c3c  torch_xla::runtime::GetComputationClient()
    @     0x7fc2f69a7d8d  torch_xla::bridge::GetDefaultDevice()
    @     0x7fc2f69a7ea5  torch_xla::bridge::GetCurrentDevice()
    @     0x7fc2f69ac767  torch_xla:
:bridge::GetCurrentAtenDevice()
    @     0x7fc2f6956195  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7fc2f697502e  pybind11::cpp_function::dispatcher()
    @     0x560b6f01310e  (unknown)
https://symbolize.stripped_domain/r/?trace=7fc3c1a969fc,7fc3
c1a4251f&map= 
*** SIGABRT received by PID 143696 (TID 143696) on cpu 207 from PID 143696; stack trace: ***
PC: @     0x7fc3c1a969fc  (unknown)  pthread_kill
    @     0x7fc2a1101067        928  (unknown)
    @     0x7fc3c1a42520  (unknown)  (unknown)
https://symbolize.strippe
d_domain/r/?trace=7fc3c1a969fc,7fc2a1101066,7fc3c1a4251f&map= 
E0109 19:45:18.426944  143696 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0109 19:45:18.426952  143696 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0109
 19:45:18.426955  143696 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0109 19:45:18.426956  143696 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0109 19:45:18.426967  143696 coredump_hook.cc:546] RAW: Cannot send
 fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0109 19:45:18.426970  143696 coredump_hook.cc:598] RAW: Dumping core locally.
E0109 19:45:18.486146  143696 process_state.cc:807] RAW: Raising signal 6 with default behavior

However, with commit "0857f2a088e9d91be89cf24f33c6564b2e19bc77", there is no issue. The issue is only releated to the code under xla/benchmark/...

Command used:

python experiment_runner.py  --suite-name=torchbench --xla=PJRT --accelerator=tpu --progress-bar --filter BERT_pytorch
@JackCaoG
Copy link
Collaborator

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
WARNING: All
 log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1704829518.293360  143696 pjrt_registry.cc:77] Non-OK-status: pjrt::LoadPjrtPlugin(\"tpu\", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu
*** Be

@will-cromar what does

Non-OK-status: pjrt::LoadPjrtPlugin(\"tpu\", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu

mean? This seems to just be a setup or hardware issue..

@zpcore
Copy link
Collaborator Author

zpcore commented Jan 10, 2024

I think it is only related to the benchmark execution script. They probably call the subprocess without inherit the environment variable from parent process where PJRT_DEVICE is set.

By the way, previously we don't need to set PJRT_DEVICE on TPU devices and it will automatically set it to TPU. But recently if we don't set the env, the compiler will report error.

@zpcore
Copy link
Collaborator Author

zpcore commented Jan 10, 2024

@frgossen and @ysiraichi to see if they have some clues about the issue.

@will-cromar
Copy link
Collaborator

This error likely means that the script is referencing the XLA device before calling spawn. I have a PR out to improve this error message: #6291

@zpcore
Copy link
Collaborator Author

zpcore commented Jan 24, 2024

@ysiraichi to this issue.

The issue poped out again. Unluckily, it didn't show the error message as checkin in #6291.

This time I noticed that the issue will continuously happen once the program execution got terminated abnormally (e.g., coredump). Based on the error message RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0
Does anyone know how to release the TPU resource from the dangling failure run?

@zpcore
Copy link
Collaborator Author

zpcore commented Jan 24, 2024

I believe this is due to the subprocess issue, which is related to the #6207. We are kind of missing some code reviews related to the benchmarks/. I will follow up with the issue.

@zpcore
Copy link
Collaborator Author

zpcore commented Jan 24, 2024

It looks like both the folked child process (https://github.com/pytorch/xla/blob/bc2ebed8dfc63a731c1f3704da0cef0f85f28865/benchmarks/experiment_runner.py#L156C1-L163C12) and root process claimed the PJRT runtime, which results in RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; .

@ysiraichi
Copy link
Collaborator

When do we actually start the PJRT client runtime? Maybe we could solve this by having an API for clearing the started runtime.

@zpcore
Copy link
Collaborator Author

zpcore commented Jan 24, 2024

The issue is due to the model was moved into the xla device here: code. Even though we call del benchmark later, the model is still being hold in the device.

@ysiraichi
Copy link
Collaborator

Maybe #6375 solves this. It makes it so we don't need to load_benchmark() for checking precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants