-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ShapedIndex #199
ShapedIndex #199
Conversation
Helps with conversion of `Int` -> CartesianIndex. This is now used within the indexing pipeline instead of jumping back into `to_index`. The `CartesianIndex` -> `Int` conversion is managed by composing a `StrideIndex`, where the strides are computed using `size_to_strides`, instead of the internal memory representation.
Previously an array that needed a unique method for `known_size` also needed a unique one for `known_size(::Type{A}, dim)`. Now `known_size` is called and then indexed, requring only one new method.
Codecov Report
@@ Coverage Diff @@
## master #199 +/- ##
==========================================
- Coverage 85.08% 84.82% -0.26%
==========================================
Files 11 11
Lines 1669 1667 -2
==========================================
- Hits 1420 1414 -6
- Misses 249 253 +4
Continue to review full report at Codecov.
|
It looks like the change in test coverage is mostly do to the deletion of code that was previously being tested but is now redundant. It will probably work better for me to address that in conjunction with my next PR. |
How about just ci = reshape(CartesianIndices(A), :)
ci[i] # gives the CartesianIndex corresponding to i::Int That's approximately as fast as a single |
Any reason not to implement it using |
The most obvious benefit here is the ability to incorporate
In this case we don't want a representation that will stick around and represents an entire array. The long game here is that it provides a minimalistic representation of an index transform that we can use in the future when creating rules for combining nested index transforms. |
julia> sizeof(ci)
48
julia> ci.
dims mi parent
julia> ci.dims
(2000000,)
julia> ci.mi
(Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}(1000, 2361183241434822607, 0, 0x07),)
julia> cii = ci.parent;
julia> typeof(cii)
CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}
julia> cii.indices
(Base.OneTo(1000), Base.OneTo(2000)) There's "nothing" there. What you really want are those And it's even better than I suggested, as much of the time in |
I'm fine with working in julia> to_cartesian_reshaped_array(x, i) = @inbounds(reshape(CartesianIndices(axes(x)), :)[i]);
julia> to_cartesian_shaped_index(x, i) = ArrayInterface.ShapedIndex(x)[i];
julia> A = @SArray(zeros(2,2,2));
julia> @btime to_cartesian_reshaped_array($A, 2)
9.127 ns (0 allocations: 0 bytes)
CartesianIndex(2, 1, 1)
julia> @btime to_cartesian_shaped_index($A, 2)
0.042 ns (0 allocations: 0 bytes)
NDIndex(2, 1, 1)
|
Anytime you get < 1 clock cycle (≈1/3ns) you know the compiler is fooling you. Use |
I get julia> @btime to_cartesian_shaped_index($A, $(Ref(2))[])
1.644 ns (0 allocations: 0 bytes)
NDIndex(2, 1, 1)
julia> @btime to_cartesian_reshaped_array($A, $(Ref(2))[])
7.193 ns (0 allocations: 0 bytes)
CartesianIndex(2, 1, 1) The Because The |
I didn't observe any big performance difference here. The julia> function sumcart(X)
out = zero(eltype(X))
R = CartesianIndices(X)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sumcart (generic function with 1 method)
julia> function sumcart_shaped(X)
out = zero(eltype(X))
R = ArrayInterface.ShapedIndex(X)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sumcart_shaped (generic function with 1 method)
julia> SX = @SArray(rand(4, 4));
julia> @btime sumcart($SX);
3.828 ns (0 allocations: 0 bytes)
julia> @btime sumcart_shaped($SX);
4.231 ns (0 allocations: 0 bytes)
julia> X = rand(64, 64);
julia> @btime sumcart($X);
230.455 ns (0 allocations: 0 bytes)
julia> @btime sumcart_shaped($X);
229.612 ns (0 allocations: 0 bytes) Checked on |
@timholy, is this what you wanted? julia> @benchmark getindex(A, 2) setup=(A=reshape(CartesianIndices(axes(@SArray(zeros(2,2,2)))), :))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.439 ns … 44.816 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.220 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.202 ns ± 1.342 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂ ▃ ▃ ▂▁ ▃ ▁▃ ▅▇ ▆▂ ▂▃ ▃▅ ▄ ▂▆ ▂
██▁▁█▇▁▁█▁▁▁██▁▃▇█▁▁▁██▄▁▁███▁▃▄██▁▁▁██▆▁▁▁██▆▁▁▁▁█▇▁▁▁▁██ █
4.44 ns Histogram: log(frequency) by time 6.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark getindex(A, 2) setup=(A=ArrayInterface.ShapedIndex(@SArray(zeros(2,2,2))))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 0.040 ns … 0.112 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 0.046 ns ┊ GC (median): 0.00%
Time (mean ± σ): 0.046 ns ± 0.002 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ █ ▅
▂▁▁▁▁▁▁▂▁▁▁▁▁▁▃▁▁▁▁▁▁▅▁▁▁▁▁▁▆▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▃ ▂
0.04 ns Histogram: frequency by time 0.048 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
|
Odd, I see julia> X = rand(64, 64);
julia> @btime sumcart($X)
5.362 μs (0 allocations: 0 bytes)
2044.2126606418335
julia> @btime sumcart_shaped($X)
115.387 ns (0 allocations: 0 bytes)
2044.2126606418326
Using |
The |
Looks like it's messed up with bounds check. Sorry I didn't mention that I opened julia with |
Ultimately we want to consider doing this before the loop, but I figured indexing collections would be a dedicated PR later (once I've got more of the other indexing stuff in place). |
I can reproduce (both are equally fast) with |
I get similar disparities in performance on an julia> @btime sumcart_shaped($A)
235.834 ns (0 allocations: 0 bytes)
2040.2091413638263
julia> @btime sumcart($A)
10.060 μs (0 allocations: 0 bytes)
2040.2091413638286 |
I'm not sure how to compare assembly. No matter what |
With StaticArray, LLVM essentially computes the |
julia> @btime sumcart_reshape($X);
105.498 μs (0 allocations: 0 bytes)
julia> @btime sumcart($X);
160.048 μs (0 allocations: 0 bytes) BUT it's different for tiny arrays (this was 200x300, dynamic not static since the compiler does the same thing as |
It's gotten a lot better in recent CPUs, but is still bad. 64 bit div
So Zen3 CPUs and Intel Ice/Tiger/Rocket Lake (newest from AMD and Intel respectively) are both faster than reported in that chart (15-40), but these are still slow. Is there a good api on the signed multiplicative inverses? |
I love that this summoned the benchmarking A-team. Just a quick example where this will be beneficial in the future. Obviously this isn't the most realistic example, but the point is that if we negate an index transform that induces cartesian indexing then we can drop the |
They were added in JuliaLang/julia#15357, and the code is https://github.com/JuliaLang/julia/blob/master/base/multinverses.jl. That's pretty much all there is. |
Examples like these are what made me initially add |
I completely agree with this. The focus with subtypes of |
Just to summarize: On my 2019 macbook I get the following on 1.7 with bounds checking off. julia> function sum_cartesian(X)
out = zero(eltype(X))
R = CartesianIndices(X)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sum_cartesian (generic function with 1 method)
julia> function sum_reshaped(X)
out = zero(eltype(X))
R = reshape(CartesianIndices(X), :)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sum_reshaped (generic function with 1 method)
julia> function sum_shaped(X)
out = zero(eltype(X))
R = ArrayInterface.ShapedIndex(X)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sum_shaped (generic function with 1 method)
julia> SA = @SArray(rand(4, 4));
julia> A=rand(64, 64);
julia> @btime sum_cartesian($A)
10.780 μs (0 allocations: 0 bytes)
2050.1954548031663
julia> @btime sum_reshaped($A)
6.534 μs (0 allocations: 0 bytes)
2050.1954548031663
julia> @btime sum_shaped($A)
235.232 ns (0 allocations: 0 bytes)
2050.1954548031654
julia> @btime sum_cartesian($SA)
8.824 ns (0 allocations: 0 bytes)
8.513627227211028
julia> @btime sum_reshaped($SA)
21.038 ns (0 allocations: 0 bytes)
8.513627227211028
julia> @btime sum_shaped($SA)
3.247 ns (0 allocations: 0 bytes)
8.51362722721103 Differences in performance of It sounds like there are some benefits to |
R = CartesianIndices(X)[:] That should be R = rehape(CartesianIndices(X), :) to avoid allocating a
julia> @btime div($(Ref(22))[], Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int}($(Ref(7))[]))
10.029 ns (0 allocations: 0 bytes)
3
julia> @btime div($(Ref(22))[], $(Ref(7))[])
1.327 ns (0 allocations: 0 bytes)
3 This is on a Tiger Lake laptop. I am not sure why the value of 1.327 ns is so much lower than 10 cycles / (4.7 cycles/ns). Either way, it'd make a decent PR to LoopVectorization to use this for integer divisions because (a) there are no SIMD integer division instructions, and (b) loops are an obvious place where you can potentially pay the setup cost of calculating the inverse. Results on Tiger Lake: julia> function sum_cartesian(X)
out = zero(eltype(X))
R = CartesianIndices(X)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sum_cartesian (generic function with 1 method)
julia> function sum_reshaped(X)
out = zero(eltype(X))
R = reshape(CartesianIndices(X), :)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sum_reshaped (generic function with 1 method)
julia> function sum_shaped(X)
out = zero(eltype(X))
R = ArrayInterface.ShapedIndex(X)
@inbounds @simd for i in eachindex(X)
out += X[R[i]]
end
return out
end
sum_shaped (generic function with 1 method)
julia> SA = @SArray(rand(4, 4));
julia> A=rand(64, 64);
julia> @btime sum_cartesian($A)
115.385 ns (0 allocations: 0 bytes)
2052.6116610922645
julia> @btime sum_reshaped($A)
3.602 μs (0 allocations: 0 bytes)
2052.6116610922645
julia> @btime sum_shaped($A)
115.385 ns (0 allocations: 0 bytes)
2052.6116610922645
julia> @btime sum_cartesian($SA)
1.537 ns (0 allocations: 0 bytes)
7.653889092106472
julia> @btime sum_reshaped($SA)
2.629 ns (0 allocations: 0 bytes)
7.653889092106472
julia> @btime sum_shaped($SA)
1.537 ns (0 allocations: 0 bytes)
7.653889092106472 On an M1 Mac: julia> SA = @SArray(rand(4, 4));
julia> A=rand(64, 64);
julia> @btime sum_cartesian($A)
896.773 ns (0 allocations: 0 bytes)
2044.78863632186
julia> @btime sum_reshaped($A)
2.630 μs (0 allocations: 0 bytes)
2044.7886363218586
julia> @btime sum_shaped($A)
896.773 ns (0 allocations: 0 bytes)
2044.78863632186
julia> @btime sum_cartesian($SA)
1.791 ns (0 allocations: 0 bytes)
9.404250521034948
julia> @btime sum_reshaped($SA)
3.041 ns (0 allocations: 0 bytes)
9.404250521034948
julia> @btime sum_shaped($SA)
1.791 ns (0 allocations: 0 bytes)
9.404250521034948
julia> versioninfo()
Julia Version 1.8.0-DEV.452
Commit dd8d3c7ac2* (2021-09-01 11:33 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.5.0) |
Sorry, I fixed it. |
K. I dropped |
Are any other changes needed here? |
@chriselrod , I moved to a patch bump here so the bug fixes here can go forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is fine if you just want to get this in. Without @propagate_inbounds
in the correct places in base, the implementation without cartesian indices will probably often be faster in practice.
I'll look into making a PR to base. If it turns out to be insurmountable, then we have a solid conversation here and documented effort that shows EDIT: This should be resolved with JuliaLang/julia#42119 |
Primary goal here was to get the indexing pipeline closer to using
ArrayIndex
in the final steps (after index conversion/checking). Previously there was an awkward step afterunsafe_getindex
where linear -> cartesian (and vis versa) went back toto_index
. NowShapedIndex
manages conversion to cartesian indices andStrideIndex
is used for conversion to linear indexing (where the strides are computed usingsize_to_strides
, instead of the internal memory representation).Getting this working also involved the following changes unrelated to to
ShapedIndex
:known_size(::Type{A})
also needed a unique method forknown_size(::Type{A}, dim)
. Nowknown_size(::Type{A})
is called and then indexed, requiring only one new method.jStrideIndex
was still using the old definition ofoffset1