Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing tensor storage (with DLPack) results in unexpected behavior. #7304

Open
ysiraichi opened this issue Jun 17, 2024 · 3 comments
Open
Assignees
Labels

Comments

@ysiraichi
Copy link
Collaborator

ysiraichi commented Jun 17, 2024

🐛 Bug

Consider the following tensors:

  • t0: an XLA tensor
  • t1: a slice of t0
  • t2: a CUDA tensor that shares storage with t0
>>> t0 = torch.arange(6).view(2, 3).to("xla")
# t0 = [[0, 1, 2], [3, 4, 5]]
>>> t1 = t0[1]
# t1 = [3, 4, 5]
>>> t2 = torch_xla.dlpack.from_xla_cuda_to_cuda(t0) # t2 is on CUDA
# t2 = [[0, 1, 2], [3, 4, 5]]

Then, consider the following sequence of in-place operations:

# (I)
>>> t2[1, 0] = 10  
# t0 = [[0, 1, 2], [10, 4, 5]]
# t1 = [3, 4, 5]
# t2 = [[0, 1, 2], [10, 4, 5]]

# (II)
>>> t1[0] *= 2  
# t0 = [[0, 1, 2], [6, 4, 5]]
# t1 = [6, 4, 5]
# t2 = [[0, 1, 2], [10, 4, 5]]

Due to #7198, we already expect that the effects of the in-place operation (II) is not going to be propagated back to the CUDA tensor t2. However, since t1 is a view of t0, I would expect (II) to use the values updated by (I). However, (II) clearly uses the value of t0 before (I), unexpectedly resulting in t0[1, 0] == 6.

This problem happens because of how the functional layer applies updates to the base tensor of a view. Even though the functional implementation does try to synchronize (i.e. regenerating the view from base), the regeneration doesn't happen because the functional wrapper of t0 doesn't know it has changed.

Expected behavior

Given #7198, I think a reasonable result would be:

>>> t1[0] *= 2  
# t0 = [[0, 1, 2], [20, 4, 5]]
# t1 = [20, 4, 5]
# t2 = [[0, 1, 2], [10, 4, 5]]

i.e. run the in-place operation with the updated base values.

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
  • torch_xla version: 3f22daa

cc @miladm @JackCaoG @vanbasten23 @bdhirsh @lezcano

@ManfeiBai
Copy link
Collaborator

Hi, @vanbasten23, is that ok to assign this ticket to you?

@vanbasten23
Copy link
Collaborator

vanbasten23 commented Jul 16, 2024

Thanks for reporting the issue. I did some investigation.

I think there is a typo in your repro:

# (I)
>>> t2[1, 0] = 10  
# t0 = [[0, 1, 2], [10, 4, 5]]
# t1 = [3, 4, 5]
# t2 = [[0, 1, 2], [10, 4, 5]]

After we do t2[1,0]=10, t1 should be [10, 4, 5]. Could you double check?

Second, regarding since t1 is a view of t0, I'm not sure about it. When we do t1 = t0[1], torch_xla operation XLANativeFunctions::select_copy is called. So t1 and t0 may not share the same buffer:

>>> import torch, torch_xla
>>> import torch_xla.debug.metrics as met
>>> import torch_xla.core.xla_model as xm
>>> t0 = torch.arange(6).view(2, 3).to("xla")
>>> t1 = t0[1]
>>> torch_xla._XLAC._unsafe_buffer_pointer(t0)
139902419206144
>>> torch_xla._XLAC._unsafe_buffer_pointer(t1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: torch_xla/csrc/init_python_bindings.cpp:2681 : Could not get the buffer pointer for XLATensor with IR that's not DeviceData
>>> print(t1)
I0000 00:00:1721172456.872579   12480 cuda_dnn.cc:530] Loaded cuDNN version 8902
tensor([3, 4, 5], device='xla:0')
>>> torch_xla._XLAC._unsafe_buffer_pointer(t1)
139902419206400

It means t1 is output of the XLA graph and t0 is the input. So if you modify t1 as t1[0] *= 2, it shouldn't propagate to t0. Wdyt? @ysiraichi

@ysiraichi
Copy link
Collaborator Author

ysiraichi commented Jul 17, 2024

After we do t2[1,0]=10, t1 should be [10, 4, 5]. Could you double check?

I'm pretty sure it's not for the following reasons:

  • Although the functionalization layer preserves the aliasing relation between t0 and t1, they don't actually share memory
  • Modifying t2 modifies the actual memory (i.e. PjRtBuffer), without going through the functionalization layer

So if you modify t1 as t1[0] *= 2, it shouldn't propagate to t0.

I believe it should. The one that guarantees mutation is propagated through aliasing relations is the functionalization layer. In other words, even though they don't actually share memory, the functionalization layer knows that whenever an in-place operation takes place, it should update the base tensor (eagerly) and the other views (lazyly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants