Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process #6207

Closed
cota opened this issue Dec 19, 2023 · 3 comments · Fixed by #6375

Comments

@cota
Copy link
Collaborator

cota commented Dec 19, 2023

🐛 Bug

In dfcf306e7 Apply precision config env vars in the root process. (#6152)
we started running load_benchmark() from experiment_runner's
main process. Unfortunately, load_benchmark() for
some models does exit the calling process.
This results in experiment_runner exiting prematurely.

To Reproduce

Try to run under XLA any of the benchmarks added to the deny list in #6199. For example:

python xla/benchmarks/experiment_runner.py --dynamo=openxla --dynamo=openxla_eval --xla=PJRT --test=eval --test=train --accelerator=cuda --output-dirname=/tmp/pix2pix --repeat=5 --print-subprocess --suite-name=torchbench --filter='^pytorch_CycleGAN_and_pix2pix$' --log-level=debug ; echo $?

Note: pytorch_CycleGAN_and_pix2pix also fails early under inductor.

Expected behavior

The above should print a 0 exit code regardless of whether the benchmark fails to run or not. However, it prints 2.

Environment

  • Reproducible on XLA backend [CPU/TPU]: GPU
  • torch_xla version: dfcf306 and later.
@cota cota changed the title benchmarks/torchbench_model: some benchmarks fail to loadSome TorchBench benchmarks fail to initialize benchmarks/torchbench_model: some benchmarks fail to load Dec 19, 2023
@cota cota changed the title benchmarks/torchbench_model: some benchmarks fail to load benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process Dec 19, 2023
@yeounoh
Copy link
Contributor

yeounoh commented Jan 17, 2024

@cota thanks for addressing this, can we close this issue now?

@cota
Copy link
Collaborator Author

cota commented Jan 26, 2024

We're working around this issue by temporarily disabling the affected benchmarks. But AFAICT this is still an issue. If you want we can close it -- I won't be working on this in the near future.
Maybe @ysiraichi will? Let's have him decide what to do with this issue.

@ysiraichi
Copy link
Collaborator

I'm only running dynamo+openxla tests, here. And, so far, I can't really reproduce these loading errors:

  • pytorch_CycleGAN_and_pix2pix: runs eval and train successfully
  • pytorch_unet: runs eval and train successfully if we don't throw on AMP
  • tacotron2: eval and train fail at execution time (not when it's loading the model)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants