Lowering `as_strided` errors for input tensors smaller than size-stride specs. #5719

ysiraichi · 2023-10-21T04:59:49Z

🐛 Bug

The following usage of as_strided errors when lowering.

x = torch.randn(20, device=xm.xla_device())
y = x[10:]
z = y.as_strided((20,), (1,), 0)
print(z)

Traceback (most recent call last):
  File "bug-as-strided.py", line 10, in <module>
    print(z)
  File "torch/_tensor.py", line 442, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "torch/_tensor_str.py", line 664, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "torch/_tensor_str.py", line 430, in _str_intern
    self = self.to("cpu")
RuntimeError: Error while lowering: [] aten::as_strided, xla_shape=f32[20]{0}, size=(20), stride=(1), storage_offset=0
Error: torch_xla/csrc/ops/as_strided.cpp:33 : Check failed: storage_offset + slice_size <= input_element_count (20 vs. 10)
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()

        torch_xla::AsStrided::Lower(torch_xla::LoweringContext*) const
        torch_xla::LoweringContext::LowerNode(torch::lazy::Node const*)
        torch_xla::LoweringContext::LoweringContext(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::lazy::BackendDevice, c10::ArrayRef<torch::lazy::Node const*>, std::unordered_map<torch::lazy::Node const*, torch::lazy::Util::EmitStatus, std::hash<torch::lazy::Node const*>, std::equal_to<torch::lazy::Node const*>, std::allocator<std::pair<torch::lazy::Node const* const, torch::lazy::Util::EmitStatus> > >)

This error shows up when trying to execute hf_Reformer from Torchbench, using openxla as backend. As far as I understand, the problem is that AOTAutograd is calling as_strided -- not entirely sure why. This problem seems to be related to the limitations of reshape functions in XLA, as suggested in #2964.

Expected behavior

I would expect it to break earlier, say, when XLANativeFunctions::as_strided is being executed, instead of when it gets to the lowering part. Or, maybe better than that, we could fallback to CPU while issueing a warning that "as_strided is creating a copy, which may not be optimal...".

Environment

PyTorch/XLA: c9a1324 (Oct 3)

The text was updated successfully, but these errors were encountered:

ysiraichi · 2023-10-21T05:00:20Z

@JackCaoG any thoughts?

JackCaoG · 2023-10-23T16:31:18Z

Right, I took a look at the logic at

xla/torch_xla/csrc/ops/as_strided.cpp

Line 26 in e7af313

XLA_CHECK_LE(storage_offset + slice_size, input_element_count);

We should be able to check this in the higher level. Can we check where does this as_strieded node being created? I suspect it is in

xla/torch_xla/csrc/aten_xla_type.cpp

Lines 699 to 716 in e7af313

    
           at::Tensor XLANativeFunctions::as_strided_copy( 
        
               const at::Tensor& self, at::IntArrayRef size, at::IntArrayRef stride, 
        
               c10::optional<int64_t> storage_offset) { 
        
             TORCH_LAZY_FN_COUNTER("xla::"); 
        
             XLATensorPtr self_tensor = bridge::GetXlaTensor(self); 
        
             auto xsize = XlaHelpers::I64List(size); 
        
             auto xstride = XlaHelpers::I64List(stride); 
        
             if (!AsStrided::StrideIsSupported(self_tensor->shape(), xsize, xstride, 
        
                                               storage_offset.value_or(0))) { 
        
               return at::native::call_fallback_fn< 
        
                   &xla_cpu_fallback, ATEN_OP(as_strided)>::call(self, size, stride, 
        
                                                                 storage_offset); 
        
             } 
        
             return bridge::AtenFromXlaTensor(tensor_methods::as_strided( 
        
                 self_tensor, std::move(xsize), std::move(xstride), 
        
                 XlaHelpers::I64Optional(storage_offset))); 
        
           }

Maybe we just need to expand AsStrided::StrideIsSupported since it should fallback here.

ysiraichi · 2023-10-23T17:03:59Z

It's exactly there!

JackCaoG · 2023-10-23T17:06:55Z

ok let's make StrideIsSupported catch this fallback early then.

ysiraichi · 2023-10-26T21:49:29Z

@JackCaoG I don't think this will work. The fallback will still fail because (in the example) only y (not x) will be materialized on CPU. i.e. the storage of y will not contain as many bytes as x's storage.

ysiraichi · 2023-10-26T22:02:44Z

I mean: it does surface the error to the graph building phase (instead of lowering). But the code still doesn't work because of the reason mentioned above. Do you still think this solution should be merged?

JackCaoG · 2023-10-30T23:25:01Z

That's weird.. I thought during fallback we move all XLA tensors to CPU, perform the operation on CPU and then move the result back to the XLA, is that not the case?

ysiraichi · 2023-10-31T14:25:20Z

As far as I understand, if we execute tensor.as_strided(...) we will move tensor to CPU, correct? If so:

# x is created with capacity for 20 elements
x = torch.randn(20, device=xm.xla_device())

# y has shape (10,), but its storage is the same as x
y = x[10:]

# y is moved to CPU
# y has shape (10,), so a CPU tensor of shape (10,) will be created
#     - i.e. the CPU tensor will have a storage capacity for 10 elements
#     - can't reshape it to (20,), since the storage doesn't have this capacity
z = y.as_strided((20,), (1,), 0)

JackCaoG · 2023-10-31T18:33:36Z

oh ok I see the problem... I guess it is one of those cases that's very hard to implement correctly. @bdhirsh Given that now functionization lives upstream, in the case of the

x = torch.randn(20, device=xm.xla_device())
y = x[10:]
z = y.as_strided((20,), (1,), 0)

can we just expand the y form [10] to [20]? I am a bit confuse what's the expected behavior of this as_strided_copy after functionization..

lezcano · 2023-11-06T10:34:28Z

In your example, you are explicitly asking for an offset of 0 when creating y, so that's exactly what you are getting.

A good way to think about as_strided is as not acting on the tensor, but on the underlying storage. In this case, the underlying storage of y allows for a size of (20,), stride of (1,) and an offset of 0 (that's exactly x).

Since as_strided works on the storage, and not in the tensor itself, we would need to move the base tensor (or the whole storage) to CPU. After passing the storage to CPU, we also need the offset of the initial tensor, as this is the only other piece of data that is used to implement as_strided.
https://github.com/pytorch/pytorch/blob/2bc1378d7be563fa9b3050bb0e0fefd6e55a9e81/aten/src/ATen/native/TensorShape.cpp#L1165-L1172
In many cases we may be able to prove statically that we don't need to copy the whole thing, but yeah.

Now, this is the sort of thing that will surely have plenty of edge cases when we mix a few views with a few in-place ops... Surely @bdhirsh will have a better understanding of all these.

lezcano · 2023-11-06T10:36:56Z

For how to do all this, you can take as reference
https://github.com/pytorch/pytorch/blob/2bc1378d7be563fa9b3050bb0e0fefd6e55a9e81/torch/_prims_common/__init__.py#L1852-L1874

ysiraichi · 2023-11-18T19:27:22Z

I feel like this is a more complex version of falling back the whole computation of z to CPU. Maybe we could introduce a pass for identifying this sub-graph. What do you think?

ysiraichi · 2023-11-20T19:26:37Z

Nevermind. I think I finally understood what you were saying. That said, I don't think we even need to move things to the CPU. We only need to apply the operation on the base tensor.

bdhirsh · 2023-11-20T23:30:28Z

We have a tentative PoR to get AOTAutograd to stop emitting as_strided() in all cases (that I'm hoping to get around to early next year). There isn't a detailed design doc yet (I'll have one before working on it) but high level idea written down here with Ed: https://docs.google.com/document/d/1DlfFq8TKbuAn2zyJxLfoW-X1qkkm5PLdHFtySo03QAk/edit

This was referenced Nov 21, 2023

Dynamo (openxla) fails when returning tensor.expand. #5837

Closed

No support for overlapped tensors. #5835

Closed

Fix as_strided for inputs smaller than the arguments specification. #5914

Merged

ysiraichi closed this as completed in #5914 Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowering `as_strided` errors for input tensors smaller than size-stride specs. #5719

Lowering `as_strided` errors for input tensors smaller than size-stride specs. #5719

ysiraichi commented Oct 21, 2023

ysiraichi commented Oct 21, 2023

JackCaoG commented Oct 23, 2023

ysiraichi commented Oct 23, 2023

JackCaoG commented Oct 23, 2023

ysiraichi commented Oct 26, 2023

ysiraichi commented Oct 26, 2023

JackCaoG commented Oct 30, 2023

ysiraichi commented Oct 31, 2023

JackCaoG commented Oct 31, 2023

lezcano commented Nov 6, 2023

lezcano commented Nov 6, 2023

ysiraichi commented Nov 18, 2023

ysiraichi commented Nov 20, 2023

bdhirsh commented Nov 20, 2023

Lowering as_strided errors for input tensors smaller than size-stride specs. #5719

Lowering as_strided errors for input tensors smaller than size-stride specs. #5719

Comments

ysiraichi commented Oct 21, 2023

🐛 Bug

Expected behavior

Environment

ysiraichi commented Oct 21, 2023

JackCaoG commented Oct 23, 2023

ysiraichi commented Oct 23, 2023

JackCaoG commented Oct 23, 2023

ysiraichi commented Oct 26, 2023

ysiraichi commented Oct 26, 2023

JackCaoG commented Oct 30, 2023

ysiraichi commented Oct 31, 2023

JackCaoG commented Oct 31, 2023

lezcano commented Nov 6, 2023

lezcano commented Nov 6, 2023

ysiraichi commented Nov 18, 2023

ysiraichi commented Nov 20, 2023

bdhirsh commented Nov 20, 2023

Lowering `as_strided` errors for input tensors smaller than size-stride specs. #5719

Lowering `as_strided` errors for input tensors smaller than size-stride specs. #5719