[torchbench] `hf_GPT2` (large, too) fails to run on `bfloat16` dtype. #6521

ysiraichi · 2024-02-12T19:03:25Z

🐛 Bug

After converting the hf_GPT2 (and its large variation) model to bfloat16 and running it (see command below), it fails with the following error:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda \
    --xla PJRT --dynamo None --test eval \
    --no-resume --print-subprocess \
    -k hf_GPT2

2024-02-10 02:55:58.205631: F ./torch_xla/csrc/runtime/debug_macros.h:20] Non-OK-status: status.status() status: INTERNAL: during context [Unknown]: Seen floating point types of different precisions in %add.281 = f32[16384,768]{1,0} add(f32[16384,768]{1,0} %dot.276, bf16[16384,768]{1,0} %broadcast.280), but mixed precision is disallowed.
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > ConsumeValue<std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > >(absl::lts_20230802::StatusOr<std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > >&&)
        torch_xla::runtime::PjRtComputationClient::Compile(std::vector<torch_xla::runtime::ComputationClient::CompileInstance, std::allocator<torch_xla::runtime::ComputationClient::CompileInstance> >)
        torch_xla::XLAGraphExecutor::Compile(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > > const&, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorCollection const&, torch::lazy::LazyGraphExecutor::PostOrderData*, std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&)
        torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
        torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, bool, bool, bool)
        torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool)




        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode



        PyRun_SimpleFileExFlags
        Py_RunMain
        Py_BytesMain
        __libc_start_main
        _start
*** End stack trace ***

To Reproduce

Run on: [benchmarks] Default to bfloat16 (inference) and AMP (training) precision. #6518

Affected Configurations

Non-Dynamo Inference
Dynamo Inference
Dynamo Training

Environment

Reproducible on XLA backend [CPU/TPU]: CUDA
torch_xla version: 408b376

cc @miladm

The text was updated successfully, but these errors were encountered:

ysiraichi · 2024-02-16T14:25:57Z

Both hf_GPT2 and hf_GPT2_large seem to be failing with this same error in the following cases:

Non-dynamo: inference
Dynamo: inference + training

ysiraichi added the xla:gpu label Feb 12, 2024

This was referenced Feb 16, 2024

[torchbench] Background_Matting fails when lowering UpsampleBilinear2D #6520

Open

Failing Torchbench Models: tracking issue #5932

Open

ysiraichi changed the title ~~[benchmarks] hf_GPT2 (large, too) fails to run on bfloat16 dtype.~~ [torchbench] hf_GPT2 (large, too) fails to run on bfloat16 dtype. Feb 29, 2024

ysiraichi mentioned this issue Feb 29, 2024

[torchbench] timm_nfnet training fails to run on AMP precision. #6649

Closed

ysiraichi mentioned this issue Mar 13, 2024

Fix type promotion for pow. #6745

Merged

JackCaoG closed this as completed in #6745 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] `hf_GPT2` (large, too) fails to run on `bfloat16` dtype. #6521

[torchbench] `hf_GPT2` (large, too) fails to run on `bfloat16` dtype. #6521

ysiraichi commented Feb 12, 2024 •

edited

Loading

ysiraichi commented Feb 16, 2024

[torchbench] hf_GPT2 (large, too) fails to run on bfloat16 dtype. #6521

[torchbench] hf_GPT2 (large, too) fails to run on bfloat16 dtype. #6521

Comments

ysiraichi commented Feb 12, 2024 • edited Loading

🐛 Bug

To Reproduce

Affected Configurations

Environment

ysiraichi commented Feb 16, 2024

[torchbench] `hf_GPT2` (large, too) fails to run on `bfloat16` dtype. #6521

[torchbench] `hf_GPT2` (large, too) fails to run on `bfloat16` dtype. #6521

ysiraichi commented Feb 12, 2024 •

edited

Loading