Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.to_numpy(), .to_cupy(), etc. #55

Closed
ax3l opened this issue Aug 3, 2022 · 6 comments · Fixed by #88
Closed

.to_numpy(), .to_cupy(), etc. #55

ax3l opened this issue Aug 3, 2022 · 6 comments · Fixed by #88
Labels
backend: cuda Specific to CUDA execution (GPUs) enhancement New feature or request

Comments

@ax3l
Copy link
Member

ax3l commented Aug 3, 2022

For all objects that expose the __array_interface__, we should add a helper member function called .to_numpy() that does nothing but create a view:

np.array(self, copy=False, order='F')
cupy.array(self, copy=False, order='F')

Similar to:

Equivalently, we want to add .to_copy() et al. functions for #30 and later DLPack interfaces.

@ax3l ax3l added the enhancement New feature or request label Aug 3, 2022
@ax3l
Copy link
Member Author

ax3l commented Oct 21, 2022

Did some tests with the order='F' and it's not fully obvious if helpful. Does keep the order of args in shape and array index access the same way it seems...

@ax3l ax3l mentioned this issue Oct 21, 2022
7 tasks
@ax3l
Copy link
Member Author

ax3l commented Jun 6, 2023

@dpgrote @RemiLehe and I did some performance tests of:

import cupy as cp
x = cp.random.rand(10000, 20000)
f = cp.copy(x, order='F')

x.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True

f.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True

r=x**2; cp.cuda.runtime.deviceSynchronize()
%timeit -n 10 r=x**2; cp.cuda.runtime.deviceSynchronize()
283 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

r=f**2; cp.cuda.runtime.deviceSynchronize()
%timeit -n 10 r=f**2; cp.cuda.runtime.deviceSynchronize()
409 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here (and we also remember from numpy), the kernels always loop contiguously by what is assumed the fastest running index in C. Thus, we would not do the user a favor returning our data with:

  • strides updated for F
  • shape updated for F
  • np.array(self, copy=False, order='F') (for arr.flags being F_CONTIGUOUS).

@ax3l
Copy link
Member Author

ax3l commented Jun 7, 2023

Idea: we add an order='F' argument to our .to_... functions, doing the conventional/convenient (F) thing by default and documenting the fast (C) thing for the expert tuner to be preferred with external libraries :)

@ax3l
Copy link
Member Author

ax3l commented Aug 7, 2023

It turns out this is pretty easy to achieve via .T, which does not return a copy but a view:

In [1]: import numpy as np

In [2]: x = np.array([[1,2,3], [4,5,6]])

In [3]: x.flags
Out[3]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

In [4]: x.T.flags
Out[4]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

Seen in an implementation by @dpgrote

@ax3l
Copy link
Member Author

ax3l commented Aug 7, 2023

Updated #88.

@ax3l
Copy link
Member Author

ax3l commented Aug 7, 2023

Opened a performance request in cupy in cupy/cupy#7783

@ax3l ax3l closed this as completed in #88 Sep 21, 2023
@ax3l ax3l added the backend: cuda Specific to CUDA execution (GPUs) label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda Specific to CUDA execution (GPUs) enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant