-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchbench Benchmark Running ERROR #6286
Comments
@will-cromar what does
mean? This seems to just be a setup or hardware issue.. |
I think it is only related to the benchmark execution script. They probably call the subprocess without inherit the environment variable from parent process where By the way, previously we don't need to set |
@frgossen and @ysiraichi to see if they have some clues about the issue. |
This error likely means that the script is referencing the XLA device before calling spawn. I have a PR out to improve this error message: #6291 |
@ysiraichi to this issue. The issue poped out again. Unluckily, it didn't show the error message as checkin in #6291. This time I noticed that the issue will continuously happen once the program execution got terminated abnormally (e.g., coredump). Based on the error message |
I believe this is due to the subprocess issue, which is related to the #6207. We are kind of missing some code reviews related to the benchmarks/. I will follow up with the issue. |
It looks like both the folked child process (https://github.com/pytorch/xla/blob/bc2ebed8dfc63a731c1f3704da0cef0f85f28865/benchmarks/experiment_runner.py#L156C1-L163C12) and root process claimed the PJRT runtime, which results in |
When do we actually start the PJRT client runtime? Maybe we could solve this by having an API for clearing the started runtime. |
The issue is due to the model was moved into the xla device here: code. Even though we call |
Maybe #6375 solves this. It makes it so we don't need to |
Tested with the lastest commit (235b82b) on TPU V5 and notice the following error:
However, with commit "0857f2a088e9d91be89cf24f33c6564b2e19bc77", there is no issue. The issue is only releated to the code under
xla/benchmark/...
Command used:
The text was updated successfully, but these errors were encountered: