do not collect arrays in orient if the underlying array is a GPUArray #10

jfeist · 2020-01-03T16:14:03Z

The collect(A) call in orient for Union{PermutedDimsArray, LinearAlgebra.Transpose, LinearAlgebra.Adjoint arrays copies GPU arrays (e.g., CuArray) to the CPU memory, such that broadcasting then fails. This pull request fixes that. I'm not aware of an easy method to check if any of these wrapper types is based on a GPU array, so I defined a new function storage_type that allows to do this easily (this feels like a function that could make sense in Adapt.jl or something similar - pinging @maleadt in case there is interest).

A short test that shows the failure (without this pull request):

using TensorCast
using CuArrays

N = 100
A = CuArray(zeros(N,N,N))
B = CuArray(rand(N))
C = CuArray(rand(N,N))

@cast A[i1, i2, i3] = B[i2] * C[i3, i1]

Which gives

ERROR: LoadError: GPU compilation of #25(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Base.Broadcast.Extruded{LinearAlgebra.Transpose{Float64,CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{Array{Float64,3},Tuple{Bool,Bool,Bool},Tuple{Int64,Int64,Int64}}}}) failed
KernelError: passing and using non-bitstype argument

Here's a short gist that tests some more cases and also gives the timings (showing that for the GPU (at least with CuArrays), not doing collect does not give significant slowdown): https://gist.github.com/jfeist/ed3bc59e86107ddcdeffbd32f414eb45

mcabbott · 2020-01-05T08:04:03Z

Thanks for this, indeed collect(A) is clearly the wrong thing for CuArrays.

Another approach I meant to look at was orient(A, (:,*,:)) == view(A, :, [CartesianIndex()],:), which I think is better than collect |> reshape on transposed Arrays. Might this work for CuArrays too?

Am travelling the next few weeks, will take a more serious look when I get home.

jfeist · 2020-01-05T10:55:36Z

Just did a quick benchmark. Using view instead of collect |> reshape with normal Array seems to be somewhat slower for 2D transposed arrays, is amazingly slow for 3D PermutedDimsArray, and on CuArray gives a segmentation fault (in each of the blocks, the first line does not need a transpose, and the second one does. The updated benchmark code, now with @btime, is in the gist linked to above):

T = Array
transpose (2D)
  427.612 μs (8 allocations: 288 bytes)
  720.348 μs (10 allocations: 320 bytes)
permute (3D)
  128.705 ms (8 allocations: 304 bytes)
  14.711 s (400000019 allocations: 5.96 GiB)
adjoint (2D)
  431.895 μs (8 allocations: 288 bytes)
  724.870 μs (9 allocations: 304 bytes)
view (2D)
  66.839 μs (9 allocations: 304 bytes)
  68.069 μs (11 allocations: 336 bytes)

T = CuArray
transpose (2D)
signal (11): Segmentation fault
in expression starting at /home/feist/.julia/dev/TensorCast/tt.jl:43
_ZN4llvm11PointerType3getEPNS_4TypeEj at /share/apps/julia/julia-1.3.0/bin/../lib/julia/libLLVM-6.0.so (unknown line)
unknown function (ip: 0x176f807)
Allocations: 1663885185 (Pool: 1663871766; Big: 13419); GC: 156

Here are the timings with the current pull request:

T = Array
transpose (2D)
  457.101 μs (7 allocations: 240 bytes)
  474.047 μs (10 allocations: 78.45 KiB)
permute (3D)
  102.921 ms (7 allocations: 256 bytes)
  99.391 ms (20 allocations: 7.63 MiB)
adjoint (2D)
  427.734 μs (7 allocations: 240 bytes)
  482.036 μs (9 allocations: 78.44 KiB)
view (2D)
  190.244 μs (6 allocations: 144 bytes)
  56.392 μs (10 allocations: 13.58 KiB)

T = CuArray
transpose (2D)
  103.190 μs (64 allocations: 3.28 KiB)
  143.827 μs (65 allocations: 3.59 KiB)
permute (3D)
  7.104 ms (64 allocations: 3.53 KiB)
  11.389 ms (99 allocations: 7.28 KiB)
adjoint (2D)
  89.853 μs (63 allocations: 3.27 KiB)
  145.562 μs (65 allocations: 3.59 KiB)
view (2D)
  38.042 μs (86 allocations: 7.20 KiB)
  38.880 μs (88 allocations: 7.30 KiB)

And for reference, the timings for Array with reshape but without collect:

Activating environment at `~/.julia/dev/TensorCast/Project.toml`
T = Array
transpose (2D)
  455.128 μs (7 allocations: 240 bytes)
  2.173 ms (5 allocations: 128 bytes)
permute (3D)
  146.538 ms (7 allocations: 256 bytes)
  713.367 ms (17 allocations: 816 bytes)
adjoint (2D)
  425.299 μs (7 allocations: 240 bytes)
  2.039 ms (5 allocations: 128 bytes)
view (2D)
  190.565 μs (6 allocations: 144 bytes)
  198.567 μs (6 allocations: 144 bytes)

Looking at the timings for the current pull request, I also realized that it might be beneficial to always collect SubArrays, even if they are not SubArrays of LazyPerm. Right now, a transposed SubArray (which gets collected) is 4x faster (for the specific case I'm testing) than a non-transposed SubArray:

view (2D)
  190.244 μs (6 allocations: 144 bytes)
  56.392 μs (10 allocations: 13.58 KiB)

Always collecting SubArrays makes both these cases equally fast:

view (2D)
  52.945 μs (9 allocations: 13.56 KiB)
  52.875 μs (10 allocations: 13.58 KiB)

mcabbott

Sorry about the delay, I finally took a look at this. Although something is wrong with my desktop, "CUDAdrv.jl failed to initialize" , so I haven't managed to run your script myself yet.

But many thanks for all the timings. And for ruling out my view idea. I am impressed that reshape doesn't appear to hurt CuArrays, but the whole thing is black magic really.

src/view.jl

…derlying array is a GPUArray

jfeist · 2020-02-03T14:45:18Z

I've implemented your suggestions and rebased to master. It now calls collect only on "normal" Arrays, which might be the safest version anyway, and does not depend on GPUArrays.

I've left the two commits in the branch, but the second one reverts a few changes of the first one (e.g., it now changes only view.jl), so you'll probably want to squash & merge.

mcabbott · 2020-02-03T20:13:49Z

Thanks! Not fussy about commits, so no need to bother squashing.

mcabbott · 2021-02-05T11:40:54Z

Since #31 changes all this, I thought it might be worth re-running this benchmark script.

Before, version 0.3.2:

julia> const CuArrays = CUDA;

julia> test_cast(Array)
T = Array
transpose (2D)
  243.447 μs (13 allocations: 304 bytes)
  246.557 μs (15 allocations: 78.50 KiB)
permute (3D)
  60.864 ms (13 allocations: 320 bytes)
  64.812 ms (19 allocations: 7.63 MiB)
adjoint (2D)
  243.311 μs (13 allocations: 304 bytes)
  260.762 μs (15 allocations: 78.50 KiB)
view (2D)
  143.562 μs (11 allocations: 192 bytes)
  33.190 μs (14 allocations: 13.61 KiB)

julia> test_cast(CuArray)
T = CuArray
transpose (2D)
  611.299 μs (32 allocations: 816 bytes)
  1.152 ms (30 allocations: 1.03 KiB)
permute (3D)
  71.966 ms (32 allocations: 832 bytes)
  130.729 ms (103 allocations: 6.91 KiB)
adjoint (2D)
  683.672 μs (32 allocations: 816 bytes)
  1.196 ms (30 allocations: 1.03 KiB)
view (2D)
  93.136 μs (29 allocations: 880 bytes)
  141.876 μs (29 allocations: 880 bytes)

(jl_dSsnnZ) pkg> st TensorCast
Status `/private/var/folders/24/y2blky9x40gdd_kp3zgx69rc0000gn/T/jl_dSsnnZ/Project.toml`
  [02d47bb6] TensorCast v0.3.2

After, with (the present state of) #31:

This never copies, hence much lower allocations on CPU, although unfortunately slower.
It's a slow GPU, but perhaps the relative speeds mean something. And they seem better.

julia> test_cast(Array)
T = Array
transpose (2D)
  241.742 μs (8 allocations: 304 bytes)
  466.370 μs (7 allocations: 208 bytes)
permute (3D)
  64.619 ms (8 allocations: 320 bytes)
  85.008 ms (7 allocations: 208 bytes)
adjoint (2D)
  242.079 μs (8 allocations: 304 bytes)
  466.952 μs (7 allocations: 208 bytes)
view (2D)
  32.243 μs (3 allocations: 64 bytes)
  42.921 μs (3 allocations: 64 bytes)

julia> test_cast(CuArray)
T = CuArray
transpose (2D)
  631.845 μs (17 allocations: 640 bytes)
  911.989 μs (16 allocations: 576 bytes)
permute (3D)
  74.894 ms (17 allocations: 656 bytes)
  89.867 ms (16 allocations: 576 bytes)
adjoint (2D)
  642.698 μs (17 allocations: 640 bytes)
  895.911 μs (16 allocations: 576 bytes)
view (2D)
  79.130 μs (16 allocations: 1008 bytes)
  95.372 μs (16 allocations: 1008 bytes)

jfeist · 2021-02-05T12:26:18Z

Nice work! I also redid the tests quickly. Interestingly, my CPU seems a lot slower than yours, but I do not see a slowdown with the new pull request, and a significant speedup in the view (2D) test. On the GPU (GeForce GTX 1080Ti), things seem slightly faster overall, and no test had a significant slowdown.

With v0.3.2:

T = Array
transpose (2D)
  1.517 ms (5 allocations: 176 bytes)
  1.540 ms (7 allocations: 78.38 KiB)
permute (3D)
  142.071 ms (5 allocations: 192 bytes)
  146.177 ms (11 allocations: 7.63 MiB)
adjoint (2D)
  1.408 ms (5 allocations: 176 bytes)
  1.427 ms (7 allocations: 78.38 KiB)
view (2D)
  230.318 μs (3 allocations: 64 bytes)
  83.058 μs (6 allocations: 13.48 KiB)

T = CuArray
transpose (2D)
  102.426 μs (28 allocations: 688 bytes)
  119.206 μs (24 allocations: 576 bytes)
permute (3D)
  10.075 ms (28 allocations: 704 bytes)
  14.606 ms (98 allocations: 7.23 KiB)
adjoint (2D)
  103.254 μs (28 allocations: 688 bytes)
  133.860 μs (24 allocations: 576 bytes)
view (2D)
  24.434 μs (24 allocations: 576 bytes)
  24.011 μs (24 allocations: 576 bytes)

With #31:

T = Array
transpose (2D)
  1.377 ms (8 allocations: 304 bytes)
  1.515 ms (7 allocations: 208 bytes)
permute (3D)
  146.386 ms (8 allocations: 320 bytes)
  146.927 ms (7 allocations: 208 bytes)
adjoint (2D)
  1.402 ms (8 allocations: 304 bytes)
  1.407 ms (7 allocations: 208 bytes)
view (2D)
  75.963 μs (3 allocations: 64 bytes)
  71.707 μs (3 allocations: 64 bytes)

T = CuArray
transpose (2D)
  110.096 μs (33 allocations: 832 bytes)
  100.527 μs (30 allocations: 736 bytes)
permute (3D)
  10.433 ms (33 allocations: 848 bytes)
  10.110 ms (30 allocations: 736 bytes)
adjoint (2D)
  110.828 μs (33 allocations: 832 bytes)
  92.674 μs (30 allocations: 736 bytes)
view (2D)
  19.824 μs (28 allocations: 688 bytes)
  21.389 μs (28 allocations: 688 bytes)

BTW, to make sure that the slower CPU is due to the model and not some other problem (i tried this in julia 1.6.beta1), here's the output of versioninfo():

julia> versioninfo()
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, broadwell)

mcabbott · 2021-02-05T12:51:42Z

Many thanks for having another look. Interesting yours doesn't slow down for the transpose; mine is with an i7-8700 which is (I think) a lot newer, and maybe there are more complicated cache effects or something.

While I'm here, the same test with @cast replaced by @tullio -- sometimes quite a bit faster on the CPU, multi-threaded though so perhaps that's cheating. Quite a bit slower on GPU, this is the most naiive possible KernelAbstractions code, whereas I guess the broadcasting implementation is doing smarter things.

julia> using Tullio, KernelAbstractions, CUDA

julia> test_tullio(Array)
T = Array
transpose (2D)
  122.615 μs (47 allocations: 4.73 KiB)
  177.265 μs (47 allocations: 4.73 KiB)
permute (3D)
  64.475 ms (69 allocations: 7.73 KiB)
  65.791 ms (69 allocations: 7.73 KiB)
adjoint (2D)
  134.513 μs (47 allocations: 4.73 KiB)
  144.674 μs (47 allocations: 4.73 KiB)
view (2D)
  39.397 μs (3 allocations: 64 bytes)
  50.674 μs (3 allocations: 64 bytes)
  
julia> test_tullio(CuArray)
T = CuArray
transpose (2D)
  1.532 ms (105 allocations: 4.05 KiB)
  2.048 ms (105 allocations: 4.05 KiB)
permute (3D)
  261.482 ms (115 allocations: 4.77 KiB)
  316.129 ms (115 allocations: 4.77 KiB)
adjoint (2D)
  1.630 ms (110 allocations: 4.12 KiB)
  2.088 ms (110 allocations: 4.12 KiB)
view (2D)
  300.322 μs (116 allocations: 5.97 KiB)
  285.933 μs (116 allocations: 5.97 KiB)

mcabbott reviewed Jan 28, 2020

View reviewed changes

src/view.jl Outdated Show resolved Hide resolved

src/view.jl Outdated Show resolved Hide resolved

src/view.jl Outdated Show resolved Hide resolved

jfeist added 2 commits February 3, 2020 15:33

do not collect permuted/transposed/adjoint arrays in orient if the un…

71aaaaa

…derlying array is a GPUArray

implement @mcabbott's suggestions

2fed21d

mcabbott merged commit 724ffe2 into mcabbott:master Feb 3, 2020

mcabbott mentioned this pull request Jun 17, 2020

CUDA array scalar getindex error #25

Closed

jfeist deleted the orient_gpu branch August 24, 2020 08:36

mcabbott mentioned this pull request Oct 13, 2020

scalar getindex when shared index changes order #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not collect arrays in orient if the underlying array is a GPUArray #10

do not collect arrays in orient if the underlying array is a GPUArray #10

jfeist commented Jan 3, 2020

mcabbott commented Jan 5, 2020

jfeist commented Jan 5, 2020

mcabbott left a comment

jfeist commented Feb 3, 2020

mcabbott commented Feb 3, 2020

mcabbott commented Feb 5, 2021

jfeist commented Feb 5, 2021

mcabbott commented Feb 5, 2021

do not collect arrays in orient if the underlying array is a GPUArray #10

do not collect arrays in orient if the underlying array is a GPUArray #10

Conversation

jfeist commented Jan 3, 2020

mcabbott commented Jan 5, 2020

jfeist commented Jan 5, 2020

mcabbott left a comment

Choose a reason for hiding this comment

jfeist commented Feb 3, 2020

mcabbott commented Feb 3, 2020

mcabbott commented Feb 5, 2021

jfeist commented Feb 5, 2021

mcabbott commented Feb 5, 2021