`mul`: convert inputs to result type. #7130

ysiraichi · 2024-05-28T23:34:28Z

This PR fixes a data-type related problem for mul operation. It does so by creating a structure OpConfig that behaves similarly to DoBinaryOp. The difference is that it takes care of pre/post-processing of inputs and outputs, casting them to the correct data-type.

Problem

t = torch.rand(10, dtype=torch.half).to(xm.xla_device())
s = torch.tensor(10, dtype=torch.double).to(xm.xla_device())
out = torch.mul_(t, s)

Tensor.mul_ is dispatched to its CompositeExplicitAutograd kernel
- It wraps the scalar into a tensor, and calls torch.mul (functional version)
DoBinaryOp is called
- Computes at::result_type (let's call it common_dtype) and passes it on to bin_op
- Note that UnwrapNumber does nothing, since s is a tensor with is_wrapped_number_ unset
The computed common_dtype is passed on to tensor_methods::mul
- Creates an IR node with data-type common_dtype
- Does nothing with its inputs
Later, when BuildMul is called, we have 2 XlaOp with different data-types: f16 and f64
- BuildMul promotes f16 to f64
The output is common_dtype (torch.float16), but the actal XlaOp is f64

Solution

Following PyTorch behavior [1, 2, 3], I created OpConfig: a structure that let us specify common pre/post-processing on inputs and outputs.

Affected Models

timm_nfnet (training+nondynamo)

cc @miladm @JackCaoG @lezcano

JackCaoG · 2024-05-29T03:46:30Z

hmm I am surprise that bf16 and f64's at::result_type is f64..

torch_xla/csrc/init_python_bindings.cpp

torch_xla/csrc/aten_xla_type.cpp

ysiraichi · 2024-05-29T15:33:21Z

hmm I am surprise that bf16 and f64's at::result_type is f64.

This is a bit confusing, so let me try and clarify the cases, given op(bf16, f64):

`bf16`	`f64`	`at::result_type`	`PromoteType`
tensor	scalar	`bf16`	`f64`
scalar	tensor	`f64`	`f64`
tensor	tensor	`f64`	`f64`

lezcano · 2024-05-29T18:06:54Z

The tl;dr is: Scalars don't promote unless they are of a different kind.
Here are the exact rules: https://pytorch.org/docs/stable/tensor_attributes.html

bhavya01 · 2024-06-13T01:22:40Z

@ysiraichi This PR is failing the mul operation on TPUs.

It fails this check https://github.com/pytorch/xla/blob/master/torch_xla/csrc/aten_xla_type.cpp#L161

>>> import torch
>>> import torch_xla
>>> x = torch.tensor([1,2,3]).to('xla')
>>> y = torch.tensor([2,4,5]).to('xla')
>>> x*y
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: torch_xla/csrc/aten_xla_type.cpp:161 : Check failed: it != inputs_.end() 
*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	torch_xla::XLANativeFunctions::mul(at::Tensor const&, at::Tensor const&)
	
	at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
	
	
	at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&)
	
	PyNumber_Multiply
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	_PyRun_InteractiveLoopObject
	
	PyRun_AnyFileExFlags
	
	Py_BytesMain
	
	__libc_start_main
	
*** End stack trace ***

ysiraichi · 2024-06-13T14:46:05Z

Apparently, you are not the only one: #7266

vanbasten23 · 2024-06-13T20:44:29Z

This PR is also impacting DDP:

JackCaoG · 2024-06-13T21:02:06Z

Let me take a look this afternoon

JackCaoG · 2024-06-13T21:58:40Z

I can't repo this issue but I do see that half of our internal TPU test crashed because of this. Let me revert this pr for now while figuring out what happened.

ysiraichi added the xla:gpu label May 28, 2024

ysiraichi requested review from JackCaoG and lezcano May 28, 2024 23:34

JackCaoG reviewed May 29, 2024

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Show resolved Hide resolved

JackCaoG reviewed May 29, 2024

View reviewed changes

torch_xla/csrc/aten_xla_type.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed May 29, 2024

View reviewed changes

torch_xla/csrc/aten_xla_type.cpp Show resolved Hide resolved

JackCaoG reviewed May 29, 2024

View reviewed changes

torch_xla/csrc/aten_xla_type.cpp Show resolved Hide resolved

ysiraichi force-pushed the ysiraichi/fix-mul-dtype-promotion branch from 7221e69 to 7e28074 Compare May 29, 2024 15:39

ysiraichi added 3 commits May 29, 2024 17:36

Add test.

dc7ef41

Add data-type promotion + OpMathType usage.

fdb724c

Rename var.

a1b67f2

ysiraichi force-pushed the ysiraichi/fix-mul-dtype-promotion branch from 7e28074 to a1b67f2 Compare May 29, 2024 20:36

JackCaoG approved these changes May 29, 2024

View reviewed changes

ysiraichi merged commit 7938bb5 into master Jun 3, 2024
19 checks passed

ysiraichi mentioned this pull request Jun 3, 2024

Failing Torchbench Models: tracking issue #5932

Open

ysiraichi mentioned this pull request Jun 13, 2024

Error: Check Failed: it != inputs_.end() When Using torch and torch_xla Nightly Version (Post-20240527) with SPMD #7266

Closed

JackCaoG mentioned this pull request Jun 13, 2024

Revert the mul change #7271

Merged

ysiraichi mentioned this pull request Jul 1, 2024

Reland: mul: convert inputs to result type. #7602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mul`: convert inputs to result type. #7130

`mul`: convert inputs to result type. #7130

ysiraichi commented May 28, 2024

JackCaoG commented May 29, 2024

ysiraichi commented May 29, 2024

lezcano commented May 29, 2024

bhavya01 commented Jun 13, 2024

ysiraichi commented Jun 13, 2024

vanbasten23 commented Jun 13, 2024

JackCaoG commented Jun 13, 2024

JackCaoG commented Jun 13, 2024

mul: convert inputs to result type. #7130

mul: convert inputs to result type. #7130

Conversation

ysiraichi commented May 28, 2024

Problem

Solution

Affected Models

JackCaoG commented May 29, 2024

ysiraichi commented May 29, 2024

lezcano commented May 29, 2024

bhavya01 commented Jun 13, 2024

ysiraichi commented Jun 13, 2024

vanbasten23 commented Jun 13, 2024

JackCaoG commented Jun 13, 2024

JackCaoG commented Jun 13, 2024

`mul`: convert inputs to result type. #7130

`mul`: convert inputs to result type. #7130