We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hf_T5_large
hf_T5_large training fails to run on dynamo. See the error below:
python xla/benchmarks/experiment_runner.py \ --suite-name torchbench --accelerator cuda --repeat 8 --iterations-per-run 1 \ --xla PJRT --dynamo None --test train \ -k hf_T5_large
Traceback (most recent call last): File "xla/benchmarks/experiment_runner.py", line 945, in <module> main() File "xla/benchmarks/experiment_runner.py", line 941, in main runner.run() File "xla/benchmarks/experiment_runner.py", line 61, in run self.run_single_config() File "xla/benchmarks/experiment_runner.py", line 256, in run_single_config metrics, last_output = self.run_once_and_gather_metrics( File "xla/benchmarks/experiment_runner.py", line 345, in run_once_and_gather_metrics output, _ = loop(iter_fn=self._default_iter_fn) File "xla/benchmarks/experiment_runner.py", line 302, in loop output, timing, trace = iter_fn(benchmark_experiment, benchmark_model, File "xla/benchmarks/experiment_runner.py", line 218, in _default_iter_fn output = benchmark_model.model_iter_fn( File "torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "xla/benchmarks/torchbench_model.py", line 400, in train super().train(inputs, collect_full_output=collect_full_output) File "xla/benchmarks/benchmark_model.py", line 156, in train self._optimizer_zero_grad() File "xla/benchmarks/benchmark_model.py", line 159, in torch_dynamo_resume_in_train_at_156 loss = self.compute_loss(pred) File "xla/benchmarks/benchmark_model.py", line 160, in torch_dynamo_resume_in_train_at_159 loss.backward() File "xla/benchmarks/benchmark_model.py", line 161, in torch_dynamo_resume_in_train_at_160 self._optimizer_step() File "xla/benchmarks/benchmark_model.py", line 150, in _optimizer_step self.optimizer.step() File "torch/optim/optimizer.py", line 391, in wrapper out = func(*args, **kwargs) File "torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) File "torch/optim/adam.py", line 135, in step @_use_grad_for_differentiable File "torch/_dynamo/eval_frame.py", line 410, in _fn return fn(*args, **kwargs) File "torch/_dynamo/external_utils.py", line 36, in inner return fn(*args, **kwargs) File "torch/_functorch/aot_autograd.py", line 917, in forward return compiled_fn(full_args) File "torch/_functorch/_aot_autograd/utils.py", line 89, in g return f(*args) File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 107, in runtime_wrapper all_outs = call_func_at_runtime_with_args( File "torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args out = normalize_as_list(f(args)) File "torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 181, in rng_functionalization_wrapper return compiled_fw(args) File "torch/_functorch/_aot_autograd/utils.py", line 89, in g return f(*args) File "torch/_dynamo/backends/torchxla.py", line 36, in fwd compiled_graph = bridge.extract_compiled_graph(model, args) File "xla/torch_xla/core/dynamo_bridge.py", line 618, in extract_compiled_graph xm.mark_step() File "xla/torch_xla/core/xla_model.py", line 1056, in mark_step torch_xla._XLAC._xla_step_marker( RuntimeError: Bad StatusOr access: INTERNAL: ptxas exited with non-zero error code 65280, output: ptxas /tmp/tempfi le-benchmarking-group-a100-40g-q60p-3afc5b57-185461-6157c733d1fd3, line 4045; error : Entry function 'loop_broadcast_fusion_7' uses too much parameter space (0x1200 bytes, 0x1100 max). ptxas fatal : Ptx assembly aborted due to errors : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
cc @miladm @JackCaoG @vanbasten23 @cota @golechwierowicz @frgossen @zpcore
The text was updated successfully, but these errors were encountered:
Since last report, it started failing due to OOM.
Sorry, something went wrong.
No branches or pull requests
🐛 Bug
hf_T5_large
training fails to run on dynamo. See the error below:Affected Configurations
Environment
cc @miladm @JackCaoG @vanbasten23 @cota @golechwierowicz @frgossen @zpcore
The text was updated successfully, but these errors were encountered: