Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShapedIndex #199

Merged
merged 6 commits into from
Sep 3, 2021
Merged

ShapedIndex #199

merged 6 commits into from
Sep 3, 2021

Conversation

Tokazama
Copy link
Member

@Tokazama Tokazama commented Sep 1, 2021

Primary goal here was to get the indexing pipeline closer to using ArrayIndex in the final steps (after index conversion/checking). Previously there was an awkward step after unsafe_getindex where linear -> cartesian (and vis versa) went back to to_index. Now ShapedIndex manages conversion to cartesian indices and StrideIndex is used for conversion to linear indexing (where the strides are computed using size_to_strides, instead of the internal memory representation).

Getting this working also involved the following changes unrelated to to ShapedIndex:

  • Previously an array that needed a unique method for known_size(::Type{A}) also needed a unique method for known_size(::Type{A}, dim). Now known_size(::Type{A}) is called and then indexed, requiring only one new method.j
  • Indexing StrideIndex was still using the old definition of offset1

Helps with conversion of `Int` -> CartesianIndex.
This is now used within the indexing pipeline instead of jumping back
into `to_index`.
The `CartesianIndex` -> `Int` conversion is managed by composing a
`StrideIndex`, where the strides are computed using `size_to_strides`,
    instead of the internal memory representation.
Previously an array that needed a unique method for `known_size` also
needed a unique one for `known_size(::Type{A}, dim)`. Now `known_size`
is called and then indexed, requring only one new method.
@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #199 (61bf33a) into master (f8d1182) will decrease coverage by 0.25%.
The diff coverage is 72.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #199      +/-   ##
==========================================
- Coverage   85.08%   84.82%   -0.26%     
==========================================
  Files          11       11              
  Lines        1669     1667       -2     
==========================================
- Hits         1420     1414       -6     
- Misses        249      253       +4     
Impacted Files Coverage Δ
src/indexing.jl 74.89% <69.23%> (-1.78%) ⬇️
src/array_index.jl 98.13% <100.00%> (ø)
src/size.jl 83.72% <100.00%> (+0.38%) ⬆️
src/stridelayout.jl 87.82% <100.00%> (+0.28%) ⬆️
src/ArrayInterface.jl 84.72% <0.00%> (-0.28%) ⬇️
src/ranges.jl 93.24% <0.00%> (-0.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8d1182...61bf33a. Read the comment docs.

@Tokazama Tokazama requested a review from chriselrod September 1, 2021 11:42
@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

It looks like the change in test coverage is mostly do to the deletion of code that was previously being tested but is now redundant. It will probably work better for me to address that in conjunction with my next PR.

@timholy
Copy link
Member

timholy commented Sep 1, 2021

How about just

ci = reshape(CartesianIndices(A), :)
ci[i]   # gives the CartesianIndex corresponding to i::Int

That's approximately as fast as a single div and you get all of them.

@chriselrod
Copy link
Collaborator

chriselrod commented Sep 1, 2021

_lin2sub is using div so it won't be faster than a div instruction, so it isn't doing any magic to be faster.
More code is worse for compile/load times.

Any reason not to implement it using reshape(CartesianIndices(A), :)?

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

The most obvious benefit here is the ability to incorporate StaticInt, which gives a very small speed boost.

and you get all of them

In this case we don't want a representation that will stick around and represents an entire array. The long game here is that it provides a minimalistic representation of an index transform that we can use in the future when creating rules for combining nested index transforms.

@timholy
Copy link
Member

timholy commented Sep 1, 2021

In this case we don't want a representation that will stick around and represents an entire array.

julia> sizeof(ci)
48

julia> ci.
dims   mi      parent
julia> ci.dims
(2000000,)

julia> ci.mi
(Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}(1000, 2361183241434822607, 0, 0x07),)

julia> cii = ci.parent;

julia> typeof(cii)
CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}

julia> cii.indices
(Base.OneTo(1000), Base.OneTo(2000))

There's "nothing" there. What you really want are those SignedMultiplicativeInverses.

And it's even better than I suggested, as much of the time in ci[i] is bounds-checking. Intrinsically it's faster than a single div.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

I'm fine with working in reshape(CartesianIndices(A), :), but I'm not sure where performance will benefit from that approach. Maybe I'm just benchmarking this incorrectly.

julia> to_cartesian_reshaped_array(x, i) = @inbounds(reshape(CartesianIndices(axes(x)), :)[i]);

julia> to_cartesian_shaped_index(x, i) = ArrayInterface.ShapedIndex(x)[i];

julia> A = @SArray(zeros(2,2,2));

julia> @btime to_cartesian_reshaped_array($A, 2)
  9.127 ns (0 allocations: 0 bytes)
CartesianIndex(2, 1, 1)

julia> @btime to_cartesian_shaped_index($A, 2)
  0.042 ns (0 allocations: 0 bytes)
NDIndex(2, 1, 1)

@timholy
Copy link
Member

timholy commented Sep 1, 2021

Anytime you get < 1 clock cycle (≈1/3ns) you know the compiler is fooling you. Use setup. Also, can you not precalculate the reshaped array? (i.e., hoist it out of a loop.) It invests time at the outset to make it all much faster later.

@chriselrod
Copy link
Collaborator

chriselrod commented Sep 1, 2021

I get

julia> @btime to_cartesian_shaped_index($A, $(Ref(2))[])
  1.644 ns (0 allocations: 0 bytes)
NDIndex(2, 1, 1)

julia> @btime to_cartesian_reshaped_array($A, $(Ref(2))[])
  7.193 ns (0 allocations: 0 bytes)
CartesianIndex(2, 1, 1)

The @code_typed and @code_native are much nicer for to_cartesian_shaped_index.
FWIW, uiCA says to_cartesian_shaped_index would take 6.33 clock cycles on average on my CPU (TigerLake) in unrolled code. If the CPU were running at 4 GHz, it could achieve a 1.644 ns time (maximum boost is 4.7 GHz).
Obviously <0.1 ns is impossible (a clock cycle at 4 GHz takes 0.25 ns), but using $(Ref(2))[] above seemed to give a reasonable time.

Because A is a StaticArray, the compiler will of course normally fully specialize on it.

The to_cartesian_reshaped_array has a lot of branches, which aren't supported by uiCA, so I can't get a similar estimate there.
Despite the @inbounds, there are three bounds errors, plus one inexacterror getting called in the assembly (behind branches).

@johnnychen94
Copy link
Member

johnnychen94 commented Sep 1, 2021

I didn't observe any big performance difference here. The reshape(, :) is actually unnecessary because it will use linear indexing here.

julia> function sumcart(X)
           out = zero(eltype(X))
           R = CartesianIndices(X)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sumcart (generic function with 1 method)

julia> function sumcart_shaped(X)
           out = zero(eltype(X))
           R = ArrayInterface.ShapedIndex(X)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sumcart_shaped (generic function with 1 method)

julia> SX = @SArray(rand(4, 4));

julia> @btime sumcart($SX);
  3.828 ns (0 allocations: 0 bytes)

julia> @btime sumcart_shaped($SX);
  4.231 ns (0 allocations: 0 bytes)

julia> X = rand(64, 64);

julia> @btime sumcart($X);
  230.455 ns (0 allocations: 0 bytes)

julia> @btime sumcart_shaped($X);
  229.612 ns (0 allocations: 0 bytes)

Checked on julia-1.6 --check-bounds=no.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

@timholy, is this what you wanted?

julia> @benchmark getindex(A, 2) setup=(A=reshape(CartesianIndices(axes(@SArray(zeros(2,2,2)))), :))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  4.439 ns  44.816 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     5.220 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.202 ns ±  1.342 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂  ▃   ▃   ▂▁   ▃   ▁▃   ▅▇    ▆▂   ▂▃    ▃▅     ▄     ▂▆ ▂
  ██▁▁█▇▁▁█▁▁▁██▁▃▇█▁▁▁██▄▁▁███▁▃▄██▁▁▁██▆▁▁▁██▆▁▁▁▁█▇▁▁▁▁██ █
  4.44 ns      Histogram: log(frequency) by time      6.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark getindex(A, 2) setup=(A=ArrayInterface.ShapedIndex(@SArray(zeros(2,2,2))))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  0.040 ns  0.112 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     0.046 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.046 ns ± 0.002 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                     ▂      █      ▅
  ▂▁▁▁▁▁▁▂▁▁▁▁▁▁▃▁▁▁▁▁▁▅▁▁▁▁▁▁▆▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▃ ▂
  0.04 ns        Histogram: frequency by time      0.048 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@chriselrod
Copy link
Collaborator

chriselrod commented Sep 1, 2021

Odd, I see

julia> X = rand(64, 64);

julia> @btime sumcart($X)
  5.362 μs (0 allocations: 0 bytes)
2044.2126606418335

julia> @btime sumcart_shaped($X)
  115.387 ns (0 allocations: 0 bytes)
2044.2126606418326

sumcart_shaped is SIMD, while sumcart is not.

Using A = @SArray(zeros(2,2,2));, the assembly is exactly the same for both of them.

@chriselrod
Copy link
Collaborator

is this what you wanted?

The 2 is what it isn't allowed to specialize on.
It only cares about the type of A::StaticArray, so randomly creating different ones shouldn't change the outcome.

@johnnychen94
Copy link
Member

Odd, I see

Looks like it's messed up with bounds check. Sorry I didn't mention that I opened julia with --check-bounds=no

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

Also, can you not precalculate the reshaped array? (i.e., hoist it out of a loop.) It invests time at the outset to make it all much faster later.

Ultimately we want to consider doing this before the loop, but I figured indexing collections would be a dedicated PR later (once I've got more of the other indexing stuff in place).

@chriselrod
Copy link
Collaborator

chriselrod commented Sep 1, 2021

Odd, I see

Looks like it's messed up with bounds check. Sorry I didn't mention that I opened julia with --check-bounds=no

I can reproduce (both are equally fast) with --check-bounds=no. Assembly also looks identical.
Seems like it's missing a few Base.@propagate_inbounds.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

I get similar disparities in performance on an Array

julia> @btime sumcart_shaped($A)
  235.834 ns (0 allocations: 0 bytes)
2040.2091413638263

julia> @btime sumcart($A)
  10.060 μs (0 allocations: 0 bytes)
2040.2091413638286

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

I'm not sure how to compare assembly. No matter what @code_* I use, it's just too many lines to eyeball a comparison for me.

@timholy
Copy link
Member

timholy commented Sep 1, 2021

With StaticArray, LLVM essentially computes the SignedMultiplicativeInverse for you (at compile time, so it is cost-free). With a dynamic one, it cannot. https://en.wikipedia.org/wiki/Montgomery_modular_multiplication. So you can use the scheme you're proposing for Static arrays, but div is pretty much a disaster on some CPUs when you actually have to do integer division: http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ (it's almost as bad as a function call). That said, that might be outdated, I have the sense that recent CPUs are better at it. (@chriselrod?)

@timholy
Copy link
Member

timholy commented Sep 1, 2021

@johnnychen94,

I didn't observe any big performance difference here

julia> @btime sumcart_reshape($X);
  105.498 μs (0 allocations: 0 bytes)

julia> @btime sumcart($X);
  160.048 μs (0 allocations: 0 bytes)

BUT it's different for tiny arrays (this was 200x300, dynamic not static since the compiler does the same thing as reshape for static). And it's quite possible that the @simd may have trouble reasoning about the integers...

@chriselrod
Copy link
Collaborator

chriselrod commented Sep 1, 2021

That said, that might be outdated, I have the sense that recent CPUs are better at it. (@chriselrod?)

It's gotten a lot better in recent CPUs, but is still bad. 64 bit div
Cycles per div

Zen(+/2) Zen3 Intel Skylake and older Ice/Tiger/Rocket Lake
14 7 21 10

So Zen3 CPUs and Intel Ice/Tiger/Rocket Lake (newest from AMD and Intel respectively) are both faster than reported in that chart (15-40), but these are still slow.
Every 7 clock cycles, Zen3 can do 14x 256 bit FMAs.

Is there a good api on the signed multiplicative inverses?
At one point, I was considering adding that as an optimization to LoopVectorization (IIRC, LLVM won't SIMD the division when performing the optimization).

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

I love that this summoned the benchmarking A-team.

Just a quick example where this will be beneficial in the future.
In a future PR I'm going to add another ArrayIndex subtype that accounts for dimension permutations.
If we have PermutedDimsArray(PermutedDimsArray(A, (3,1,2)), (2, 3, 1)) then they negate each other, but a linear index would still undergo an initial transformation to a cartesian index.
If we flatten out the index transformations into simple representations and combine them, then we don't return something that uses ShapedIndex and can avoid related method calls at run time.

Obviously this isn't the most realistic example, but the point is that if we negate an index transform that induces cartesian indexing then we can drop the ShapedIndex entirely.

@timholy
Copy link
Member

timholy commented Sep 1, 2021

Is there a good api on the signed multiplicative inverses?

They were added in JuliaLang/julia#15357, and the code is https://github.com/JuliaLang/julia/blob/master/base/multinverses.jl. That's pretty much all there is.

@timholy
Copy link
Member

timholy commented Sep 1, 2021

PermutedDimsArray(PermutedDimsArray(A, (3,1,2)), (2, 3, 1))

Examples like these are what made me initially add permuteddimsview to ImageCore. IMO PermutedDimsArray is a constructor and should build one, but permuteddimsview is a function and can do whatever it wants. Obviously that's (mostly) just a convention, but I've found it becomes important for inferrability in cases of recursion.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

IMO PermutedDimsArray is a constructor and should build one, but permuteddimsview is a function and can do whatever it wants.

I completely agree with this. The focus with subtypes of ArrayIndex is to have a set of concise, flexible, and composable index transformations. Some of them may only require a buffer to compose a complete array, but others will require additional context (for example, any strided array could be represented by a buffer, tuple of sizes, and a StrideIndex). So if we do end up using these to construct arrays (a la JuliaLang/julia#18161), we'll need to figure out how an ArrayIndex interacts with other arguments to construct a valid array.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 1, 2021

Just to summarize:

On my 2019 macbook I get the following on 1.7 with bounds checking off.

julia> function sum_cartesian(X)
           out = zero(eltype(X))
           R = CartesianIndices(X)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sum_cartesian (generic function with 1 method)

julia> function sum_reshaped(X)
           out = zero(eltype(X))
           R = reshape(CartesianIndices(X), :)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sum_reshaped (generic function with 1 method)

julia> function sum_shaped(X)
           out = zero(eltype(X))
           R = ArrayInterface.ShapedIndex(X)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sum_shaped (generic function with 1 method)

julia> SA = @SArray(rand(4, 4));

julia> A=rand(64, 64);

julia> @btime sum_cartesian($A)
  10.780 μs (0 allocations: 0 bytes)
2050.1954548031663

julia> @btime sum_reshaped($A)
  6.534 μs (0 allocations: 0 bytes)
2050.1954548031663

julia> @btime sum_shaped($A)
  235.232 ns (0 allocations: 0 bytes)
2050.1954548031654

julia> @btime sum_cartesian($SA)
  8.824 ns (0 allocations: 0 bytes)
8.513627227211028

julia> @btime sum_reshaped($SA)
  21.038 ns (0 allocations: 0 bytes)
8.513627227211028

julia> @btime sum_shaped($SA)
  3.247 ns (0 allocations: 0 bytes)
8.51362722721103

Differences in performance of ShapedIndex and CartesianIndices are marginal once bounds checking is off.
I think there is some merit in having ShapedIndex being a subtype of ArrayIndex and being slightly simpler than CartesianIndices, but if we get the same performance putting a PR into base for the missing @propagate_inbounds then we could just define a constant that uses OptionallyStaticUnitRange to ensure we have essentially the same thing when dispatching.

It sounds like there are some benefits to SignedMultiplicativeInverse, but they are going to require optimizing with hardware in mind.
This makes me think something like LoopVectorization should be managing it.
If we need to define a simple method or type that's commonly available that might be worth considering, but I don't have any ideas at the moment on the best way to implement that, so perhaps best to approach in another PR?

@chriselrod
Copy link
Collaborator

chriselrod commented Sep 1, 2021

R = CartesianIndices(X)[:]

That should be

R = rehape(CartesianIndices(X), :)

to avoid allocating a Vector.

Base.MultiplicativeInverse looks simple/straightforward to use.

julia> @btime div($(Ref(22))[], Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int}($(Ref(7))[]))
  10.029 ns (0 allocations: 0 bytes)
3

julia> @btime div($(Ref(22))[], $(Ref(7))[])
  1.327 ns (0 allocations: 0 bytes)
3

This is on a Tiger Lake laptop. I am not sure why the value of 1.327 ns is so much lower than 10 cycles / (4.7 cycles/ns).

Either way, it'd make a decent PR to LoopVectorization to use this for integer divisions because (a) there are no SIMD integer division instructions, and (b) loops are an obvious place where you can potentially pay the setup cost of calculating the inverse.

Results on Tiger Lake:

julia> function sum_cartesian(X)
           out = zero(eltype(X))
           R = CartesianIndices(X)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sum_cartesian (generic function with 1 method)

julia> function sum_reshaped(X)
           out = zero(eltype(X))
           R = reshape(CartesianIndices(X), :)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sum_reshaped (generic function with 1 method)

julia> function sum_shaped(X)
           out = zero(eltype(X))
           R = ArrayInterface.ShapedIndex(X)
           @inbounds @simd for i in eachindex(X)
               out += X[R[i]]
           end
           return out
       end
sum_shaped (generic function with 1 method)

julia> SA = @SArray(rand(4, 4));

julia> A=rand(64, 64);

julia> @btime sum_cartesian($A)
  115.385 ns (0 allocations: 0 bytes)
2052.6116610922645

julia> @btime sum_reshaped($A)
  3.602 μs (0 allocations: 0 bytes)
2052.6116610922645

julia> @btime sum_shaped($A)
  115.385 ns (0 allocations: 0 bytes)
2052.6116610922645

julia> @btime sum_cartesian($SA)
  1.537 ns (0 allocations: 0 bytes)
7.653889092106472

julia> @btime sum_reshaped($SA)
  2.629 ns (0 allocations: 0 bytes)
7.653889092106472

julia> @btime sum_shaped($SA)
  1.537 ns (0 allocations: 0 bytes)
7.653889092106472

On an M1 Mac:

julia> SA = @SArray(rand(4, 4));

julia> A=rand(64, 64);

julia> @btime sum_cartesian($A)
  896.773 ns (0 allocations: 0 bytes)
2044.78863632186

julia> @btime sum_reshaped($A)
  2.630 μs (0 allocations: 0 bytes)
2044.7886363218586

julia> @btime sum_shaped($A)
  896.773 ns (0 allocations: 0 bytes)
2044.78863632186

julia> @btime sum_cartesian($SA)
  1.791 ns (0 allocations: 0 bytes)
9.404250521034948

julia> @btime sum_reshaped($SA)
  3.041 ns (0 allocations: 0 bytes)
9.404250521034948

julia> @btime sum_shaped($SA)
  1.791 ns (0 allocations: 0 bytes)
9.404250521034948

julia> versioninfo()
Julia Version 1.8.0-DEV.452
Commit dd8d3c7ac2* (2021-09-01 11:33 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.5.0)

@Tokazama
Copy link
Member Author

Tokazama commented Sep 2, 2021

Sorry, I fixed it.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 2, 2021

K. I dropped ShapedIndex in favor of CartesianIndices composed through a call to ArrayInterface.indices

@Tokazama
Copy link
Member Author

Tokazama commented Sep 2, 2021

Are any other changes needed here?

@Tokazama
Copy link
Member Author

Tokazama commented Sep 3, 2021

@chriselrod , I moved to a patch bump here so the bug fixes here can go forward.

Copy link
Collaborator

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is fine if you just want to get this in. Without @propagate_inbounds in the correct places in base, the implementation without cartesian indices will probably often be faster in practice.

@Tokazama
Copy link
Member Author

Tokazama commented Sep 3, 2021

Without @propagate_inbounds in the correct places in base, the implementation without cartesian indices will probably often be faster in practice.

I'll look into making a PR to base. If it turns out to be insurmountable, then we have a solid conversation here and documented effort that shows ShapedIndex wouldn't be a redundant addition.

EDIT: This should be resolved with JuliaLang/julia#42119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants