benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process #6207

cota · 2023-12-19T06:15:39Z

🐛 Bug

In dfcf306e7 Apply precision config env vars in the root process. (#6152)
we started running load_benchmark() from experiment_runner's
main process. Unfortunately, load_benchmark() for
some models does exit the calling process.
This results in experiment_runner exiting prematurely.

To Reproduce

Try to run under XLA any of the benchmarks added to the deny list in #6199. For example:

python xla/benchmarks/experiment_runner.py --dynamo=openxla --dynamo=openxla_eval --xla=PJRT --test=eval --test=train --accelerator=cuda --output-dirname=/tmp/pix2pix --repeat=5 --print-subprocess --suite-name=torchbench --filter='^pytorch_CycleGAN_and_pix2pix$' --log-level=debug ; echo $?

Note: pytorch_CycleGAN_and_pix2pix also fails early under inductor.

Expected behavior

The above should print a 0 exit code regardless of whether the benchmark fails to run or not. However, it prints 2.

Environment

Reproducible on XLA backend [CPU/TPU]: GPU
torch_xla version: dfcf306 and later.

The text was updated successfully, but these errors were encountered:

yeounoh · 2024-01-17T19:28:27Z

@cota thanks for addressing this, can we close this issue now?

cota · 2024-01-26T22:18:12Z

We're working around this issue by temporarily disabling the affected benchmarks. But AFAICT this is still an issue. If you want we can close it -- I won't be working on this in the near future.
Maybe @ysiraichi will? Let's have him decide what to do with this issue.

ysiraichi · 2024-01-26T23:00:29Z

I'm only running dynamo+openxla tests, here. And, so far, I can't really reproduce these loading errors:

pytorch_CycleGAN_and_pix2pix: runs eval and train successfully
pytorch_unet: runs eval and train successfully if we don't throw on AMP
tacotron2: eval and train fail at execution time (not when it's loading the model)

cota changed the title ~~benchmarks/torchbench_model: some benchmarks fail to loadSome TorchBench benchmarks fail to initialize~~ benchmarks/torchbench_model: some benchmarks fail to load Dec 19, 2023

cota changed the title ~~benchmarks/torchbench_model: some benchmarks fail to load~~ benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process Dec 19, 2023

cota mentioned this issue Dec 19, 2023

benchmarks/torchbench_model: skip benchmarks that fail to load #6199

Merged

zpcore mentioned this issue Jan 24, 2024

Torchbench Benchmark Running ERROR #6286

Open

ysiraichi linked a pull request Feb 16, 2024 that will close this issue

Use benchmark_cls for checking precision. #6375

Merged

ysiraichi closed this as completed Feb 16, 2024

cota mentioned this issue Feb 16, 2024

[benchmarks] Update DENY_LIST. #6558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process #6207

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process #6207

cota commented Dec 19, 2023 •

edited

Loading

yeounoh commented Jan 17, 2024

cota commented Jan 26, 2024

ysiraichi commented Jan 26, 2024

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process #6207

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process #6207

Comments

cota commented Dec 19, 2023 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

yeounoh commented Jan 17, 2024

cota commented Jan 26, 2024

ysiraichi commented Jan 26, 2024

cota commented Dec 19, 2023 •

edited

Loading