-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not collect arrays in orient if the underlying array is a GPUArray #10
Conversation
Thanks for this, indeed Another approach I meant to look at was Am travelling the next few weeks, will take a more serious look when I get home. |
Just did a quick benchmark. Using T = Array
transpose (2D)
427.612 μs (8 allocations: 288 bytes)
720.348 μs (10 allocations: 320 bytes)
permute (3D)
128.705 ms (8 allocations: 304 bytes)
14.711 s (400000019 allocations: 5.96 GiB)
adjoint (2D)
431.895 μs (8 allocations: 288 bytes)
724.870 μs (9 allocations: 304 bytes)
view (2D)
66.839 μs (9 allocations: 304 bytes)
68.069 μs (11 allocations: 336 bytes)
T = CuArray
transpose (2D)
signal (11): Segmentation fault
in expression starting at /home/feist/.julia/dev/TensorCast/tt.jl:43
_ZN4llvm11PointerType3getEPNS_4TypeEj at /share/apps/julia/julia-1.3.0/bin/../lib/julia/libLLVM-6.0.so (unknown line)
unknown function (ip: 0x176f807)
Allocations: 1663885185 (Pool: 1663871766; Big: 13419); GC: 156 Here are the timings with the current pull request: T = Array
transpose (2D)
457.101 μs (7 allocations: 240 bytes)
474.047 μs (10 allocations: 78.45 KiB)
permute (3D)
102.921 ms (7 allocations: 256 bytes)
99.391 ms (20 allocations: 7.63 MiB)
adjoint (2D)
427.734 μs (7 allocations: 240 bytes)
482.036 μs (9 allocations: 78.44 KiB)
view (2D)
190.244 μs (6 allocations: 144 bytes)
56.392 μs (10 allocations: 13.58 KiB)
T = CuArray
transpose (2D)
103.190 μs (64 allocations: 3.28 KiB)
143.827 μs (65 allocations: 3.59 KiB)
permute (3D)
7.104 ms (64 allocations: 3.53 KiB)
11.389 ms (99 allocations: 7.28 KiB)
adjoint (2D)
89.853 μs (63 allocations: 3.27 KiB)
145.562 μs (65 allocations: 3.59 KiB)
view (2D)
38.042 μs (86 allocations: 7.20 KiB)
38.880 μs (88 allocations: 7.30 KiB) And for reference, the timings for Activating environment at `~/.julia/dev/TensorCast/Project.toml`
T = Array
transpose (2D)
455.128 μs (7 allocations: 240 bytes)
2.173 ms (5 allocations: 128 bytes)
permute (3D)
146.538 ms (7 allocations: 256 bytes)
713.367 ms (17 allocations: 816 bytes)
adjoint (2D)
425.299 μs (7 allocations: 240 bytes)
2.039 ms (5 allocations: 128 bytes)
view (2D)
190.565 μs (6 allocations: 144 bytes)
198.567 μs (6 allocations: 144 bytes) Looking at the timings for the current pull request, I also realized that it might be beneficial to always collect SubArrays, even if they are not SubArrays of LazyPerm. Right now, a transposed SubArray (which gets collected) is 4x faster (for the specific case I'm testing) than a non-transposed SubArray: view (2D)
190.244 μs (6 allocations: 144 bytes)
56.392 μs (10 allocations: 13.58 KiB) Always collecting SubArrays makes both these cases equally fast: view (2D)
52.945 μs (9 allocations: 13.56 KiB)
52.875 μs (10 allocations: 13.58 KiB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay, I finally took a look at this. Although something is wrong with my desktop, "CUDAdrv.jl failed to initialize" , so I haven't managed to run your script myself yet.
But many thanks for all the timings. And for ruling out my view
idea. I am impressed that reshape
doesn't appear to hurt CuArrays, but the whole thing is black magic really.
…derlying array is a GPUArray
I've implemented your suggestions and rebased to master. It now calls I've left the two commits in the branch, but the second one reverts a few changes of the first one (e.g., it now changes only |
Thanks! Not fussy about commits, so no need to bother squashing. |
Since #31 changes all this, I thought it might be worth re-running this benchmark script. Before, version 0.3.2: julia> const CuArrays = CUDA;
julia> test_cast(Array)
T = Array
transpose (2D)
243.447 μs (13 allocations: 304 bytes)
246.557 μs (15 allocations: 78.50 KiB)
permute (3D)
60.864 ms (13 allocations: 320 bytes)
64.812 ms (19 allocations: 7.63 MiB)
adjoint (2D)
243.311 μs (13 allocations: 304 bytes)
260.762 μs (15 allocations: 78.50 KiB)
view (2D)
143.562 μs (11 allocations: 192 bytes)
33.190 μs (14 allocations: 13.61 KiB)
julia> test_cast(CuArray)
T = CuArray
transpose (2D)
611.299 μs (32 allocations: 816 bytes)
1.152 ms (30 allocations: 1.03 KiB)
permute (3D)
71.966 ms (32 allocations: 832 bytes)
130.729 ms (103 allocations: 6.91 KiB)
adjoint (2D)
683.672 μs (32 allocations: 816 bytes)
1.196 ms (30 allocations: 1.03 KiB)
view (2D)
93.136 μs (29 allocations: 880 bytes)
141.876 μs (29 allocations: 880 bytes)
(jl_dSsnnZ) pkg> st TensorCast
Status `/private/var/folders/24/y2blky9x40gdd_kp3zgx69rc0000gn/T/jl_dSsnnZ/Project.toml`
[02d47bb6] TensorCast v0.3.2 After, with (the present state of) #31:
julia> test_cast(Array)
T = Array
transpose (2D)
241.742 μs (8 allocations: 304 bytes)
466.370 μs (7 allocations: 208 bytes)
permute (3D)
64.619 ms (8 allocations: 320 bytes)
85.008 ms (7 allocations: 208 bytes)
adjoint (2D)
242.079 μs (8 allocations: 304 bytes)
466.952 μs (7 allocations: 208 bytes)
view (2D)
32.243 μs (3 allocations: 64 bytes)
42.921 μs (3 allocations: 64 bytes)
julia> test_cast(CuArray)
T = CuArray
transpose (2D)
631.845 μs (17 allocations: 640 bytes)
911.989 μs (16 allocations: 576 bytes)
permute (3D)
74.894 ms (17 allocations: 656 bytes)
89.867 ms (16 allocations: 576 bytes)
adjoint (2D)
642.698 μs (17 allocations: 640 bytes)
895.911 μs (16 allocations: 576 bytes)
view (2D)
79.130 μs (16 allocations: 1008 bytes)
95.372 μs (16 allocations: 1008 bytes) |
Nice work! I also redid the tests quickly. Interestingly, my CPU seems a lot slower than yours, but I do not see a slowdown with the new pull request, and a significant speedup in the view (2D) test. On the GPU (GeForce GTX 1080Ti), things seem slightly faster overall, and no test had a significant slowdown. With v0.3.2: T = Array
transpose (2D)
1.517 ms (5 allocations: 176 bytes)
1.540 ms (7 allocations: 78.38 KiB)
permute (3D)
142.071 ms (5 allocations: 192 bytes)
146.177 ms (11 allocations: 7.63 MiB)
adjoint (2D)
1.408 ms (5 allocations: 176 bytes)
1.427 ms (7 allocations: 78.38 KiB)
view (2D)
230.318 μs (3 allocations: 64 bytes)
83.058 μs (6 allocations: 13.48 KiB)
T = CuArray
transpose (2D)
102.426 μs (28 allocations: 688 bytes)
119.206 μs (24 allocations: 576 bytes)
permute (3D)
10.075 ms (28 allocations: 704 bytes)
14.606 ms (98 allocations: 7.23 KiB)
adjoint (2D)
103.254 μs (28 allocations: 688 bytes)
133.860 μs (24 allocations: 576 bytes)
view (2D)
24.434 μs (24 allocations: 576 bytes)
24.011 μs (24 allocations: 576 bytes) With #31: T = Array
transpose (2D)
1.377 ms (8 allocations: 304 bytes)
1.515 ms (7 allocations: 208 bytes)
permute (3D)
146.386 ms (8 allocations: 320 bytes)
146.927 ms (7 allocations: 208 bytes)
adjoint (2D)
1.402 ms (8 allocations: 304 bytes)
1.407 ms (7 allocations: 208 bytes)
view (2D)
75.963 μs (3 allocations: 64 bytes)
71.707 μs (3 allocations: 64 bytes)
T = CuArray
transpose (2D)
110.096 μs (33 allocations: 832 bytes)
100.527 μs (30 allocations: 736 bytes)
permute (3D)
10.433 ms (33 allocations: 848 bytes)
10.110 ms (30 allocations: 736 bytes)
adjoint (2D)
110.828 μs (33 allocations: 832 bytes)
92.674 μs (30 allocations: 736 bytes)
view (2D)
19.824 μs (28 allocations: 688 bytes)
21.389 μs (28 allocations: 688 bytes) BTW, to make sure that the slower CPU is due to the model and not some other problem (i tried this in julia 1.6.beta1), here's the output of julia> versioninfo()
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, broadwell) |
Many thanks for having another look. Interesting yours doesn't slow down for the transpose; mine is with an i7-8700 which is (I think) a lot newer, and maybe there are more complicated cache effects or something. While I'm here, the same test with julia> using Tullio, KernelAbstractions, CUDA
julia> test_tullio(Array)
T = Array
transpose (2D)
122.615 μs (47 allocations: 4.73 KiB)
177.265 μs (47 allocations: 4.73 KiB)
permute (3D)
64.475 ms (69 allocations: 7.73 KiB)
65.791 ms (69 allocations: 7.73 KiB)
adjoint (2D)
134.513 μs (47 allocations: 4.73 KiB)
144.674 μs (47 allocations: 4.73 KiB)
view (2D)
39.397 μs (3 allocations: 64 bytes)
50.674 μs (3 allocations: 64 bytes)
julia> test_tullio(CuArray)
T = CuArray
transpose (2D)
1.532 ms (105 allocations: 4.05 KiB)
2.048 ms (105 allocations: 4.05 KiB)
permute (3D)
261.482 ms (115 allocations: 4.77 KiB)
316.129 ms (115 allocations: 4.77 KiB)
adjoint (2D)
1.630 ms (110 allocations: 4.12 KiB)
2.088 ms (110 allocations: 4.12 KiB)
view (2D)
300.322 μs (116 allocations: 5.97 KiB)
285.933 μs (116 allocations: 5.97 KiB) |
The
collect(A)
call inorient
forUnion{PermutedDimsArray, LinearAlgebra.Transpose, LinearAlgebra.Adjoint
arrays copies GPU arrays (e.g.,CuArray
) to the CPU memory, such that broadcasting then fails. This pull request fixes that. I'm not aware of an easy method to check if any of these wrapper types is based on a GPU array, so I defined a new functionstorage_type
that allows to do this easily (this feels like a function that could make sense in Adapt.jl or something similar - pinging @maleadt in case there is interest).A short test that shows the failure (without this pull request):
Which gives
Here's a short gist that tests some more cases and also gives the timings (showing that for the GPU (at least with CuArrays), not doing collect does not give significant slowdown): https://gist.github.com/jfeist/ed3bc59e86107ddcdeffbd32f414eb45