Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of bit shift #13

Closed
zsoerenm opened this issue Jan 25, 2020 · 14 comments · Fixed by #21
Closed

Performance of bit shift #13

zsoerenm opened this issue Jan 25, 2020 · 14 comments · Fixed by #21
Labels
help wanted Extra attention is needed performance must go faster

Comments

@zsoerenm
Copy link

Hello, great package!

Is this kind of performance regression expected?

julia> @btime $(Ref(Int1024(2131)))[] << $(Ref(1))[]
  295.443 ns (0 allocations: 0 bytes)
4262

julia> @btime $(Ref(Int64(2131)))[] << $(Ref(1))[]
  2.256 ns (0 allocations: 0 bytes)
4262
@rfourquet
Copy link
Owner

Thanks for the report! I had not noticed this, the shift operations used in this package are directly generated by LLVM. There might be some possible tweaking to improve the situation, for "small" values in a "big" type, or even by converting to BigInt back and forth as if it ends up more efficient.

@rfourquet rfourquet added help wanted Extra attention is needed performance must go faster labels Mar 8, 2020
@eschnett
Copy link

Note that

@btime $(Ref(Int1024(2131)))[] << $(Ref(Unsigned(1)))[]

is much faster (but still not fast enough).

@jessymilare
Copy link
Contributor

I got the at least part of the issue. When shift argument is signed, LLVM shifts both right and left and selects the result latter, which seems to drop performance for 256-bits+ integers

I tried to change the ifelse to an if block and it didn't help. I don't know how to tell LLVM not to transform the if-then-else block into a select.

Other then that, there is another issue with shifting big integers. I believe we are to expect some overhead for shifting many words. After all, there needs to be at least 3 operations for each word (e.g. if it is shifting left, then it needs to copy the upper portion of the first word, shift it right, and OR it with the second word shifted left). And the algorithm can get quite complex if you shift more bits than the word size.

However, the amount of overhead for 512-bit and 1024-bit seems insane (and I confess I known nothing about assembly to understand WTH LLVM is doing).

@jessymilare
Copy link
Contributor

That seems indeed to be the problem. When LLVM cannot tell how large the shift is, it emits code for several shift lengths (small shift, more than one word, more than two words...) For instance, if the second argument is an UInt8 instead of an UInt, the time for shifting an 1024-bit integer is divided by two. If the shift length is constant, the time is again divided by 5.

This sheds some light on how to handle the problem. I'll try to make some tests here and see if I can come up with a good solution.

@zsoerenm
Copy link
Author

Great!
You are right:

julia> @btime $(Ref(Int1024(2131)))[] << $(Ref(1))[]
  276.963 ns (0 allocations: 0 bytes)
4262

julia> @btime $(Ref(Int1024(2131)))[] << $(Ref(Unsigned(1)))[]
  131.362 ns (0 allocations: 0 bytes)
4262

julia> @btime $(Ref(Int1024(2131)))[] << $(Ref(UInt8(1)))[]
  51.531 ns (0 allocations: 0 bytes)
4262

julia> @btime $(Ref(Int1024(2131)))[] << $(Ref(true))[]
  10.483 ns (0 allocations: 0 bytes)
4262

I didn't know that you can shift by a bool!

If the shift length is constant, the time is again divided by 5.

What do you mean by this?

@jessymilare
Copy link
Contributor

jessymilare commented Jun 26, 2021

What do you mean by this?

julia> x = rand(Int1024);

julia> shift_64(x) = x << 64
shift_64 (generic function with 1 method)

julia> @benchmark x << 64
BenchmarkTools.Trial: 
  memory estimate:  144 bytes
  allocs estimate:  1
  --------------
  minimum time:     240.860 ns (0.00% GC)
  median time:      244.392 ns (0.00% GC)
  mean time:        253.751 ns (0.48% GC)
  maximum time:     16.503 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     401

julia> @benchmark shift_64(x)
BenchmarkTools.Trial: 
  memory estimate:  144 bytes
  allocs estimate:  1
  --------------
  minimum time:     18.505 ns (0.00% GC)
  median time:      19.281 ns (0.00% GC)
  mean time:        21.600 ns (4.76% GC)
  maximum time:     558.834 ns (88.22% GC)
  --------------
  samples:          10000
  evals/sample:     997

The compiler optimizes shift_64 for shifting 64 bits, since it knows the value at compile time.

@jessymilare
Copy link
Contributor

jessymilare commented Jun 26, 2021

It seems the code generated by LLVM has been created and optimized for integers up to 128-bit, in which case emitting generic code and selecting the correct result (i.e. with ifelse) is faster than conditional jumps (i.e. if-then-else blocks).

I don't know any way to tell LLVM "this integer is between 0 and 63" (except, of course, with a code like y == 63 && return x << 63; y == 62 && return x << 62...). If that is possible, the code can be optimized into a two-pass code: shift with a multiple of 64 and another shift less than 64. But that requires writing the shift code by hand, which will probably make LLVM not to transform a conditional jump into a select, which would be faster given the size of the integers in BitIntegers.

@jessymilare
Copy link
Contributor

Ok, I think I found a solution and it seems to work.

What I did is to shift by steps. One function shift_small which works for 0 <= y < 64 (shift less than one word), which dispatches in a switch-case-like fashion. Another one working for higher values, shifting a fixed value and decreasing y as shifts are made.

That way, it may take several shifts for higher values of y (up to 6 shifts if x has 1024 bits), but it still much faster than current code in all tests I've make (5 to 7 x speedup for several values of y and x with 1024 bits). Also, generated assembly code is much smaller than before.

I'll make tests to see if it is working properly and, if everything is ok, I'll send a pull request.

@jessymilare
Copy link
Contributor

jessymilare commented Jun 27, 2021

I have made tests and the choices in my fork seem to be the best combination. These are the timings I'm getting with it:

julia> x1024 = rand(Int1024, 1000);

julia> x512 = rand(Int512, 1000);

julia> x256 = rand(Int256, 1000);

julia> x128 = rand(Int128, 1000);

julia> s1 = rand(-1024:1024, 1000);

julia> s2 = rand(-256:256, 1000);

julia> s3 = rand(-63:63, 1000);

julia> @btime x1024 .>> s1;
  32.723 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .>> s2;
  32.740 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .>> s3;
  24.826 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .<< s1;
  32.722 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .<< s2;
  32.798 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .<< s3;
  24.792 μs (4 allocations: 125.14 KiB)

julia> @btime x512 .>> s1;
  11.259 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .>> s2;
  15.961 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .>> s3;
  12.987 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .<< s1;
  11.000 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .<< s2;
  15.720 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .<< s3;
  12.974 μs (4 allocations: 62.64 KiB)

julia> @btime x256 .>> s1;
  6.776 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .>> s2;
  9.410 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .>> s3;
  9.315 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .<< s1;
  6.755 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .<< s2;
  9.332 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .<< s3;
  9.279 μs (4 allocations: 31.39 KiB)

julia> @btime x128 .>> s3;
  3.287 μs (3 allocations: 15.81 KiB)

julia> @btime x128 .<< s3;
  2.831 μs (3 allocations: 15.81 KiB)

I believe this performance is reasonable, considering that the shift becomes more complex as the length of x increases.

@jessymilare
Copy link
Contributor

One issue with that code is that the compiler cannot optimize shifting with constant values:

julia> shift1(x) = x << 64
shift1 (generic function with 1 method)

julia> shift2(x) = Base.shl_int(x, 64)
shift2 (generic function with 1 method)

julia> @benchmark shift1.(x1024)
BenchmarkTools.Trial: 
  memory estimate:  125.14 KiB
  allocs estimate:  5
  --------------
  minimum time:     24.059 μs (0.00% GC)
  median time:      25.805 μs (0.00% GC)
  mean time:        27.246 μs (1.73% GC)
  maximum time:     361.932 μs (81.99% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark shift2.(x1024)
BenchmarkTools.Trial: 
  memory estimate:  125.14 KiB
  allocs estimate:  5
  --------------
  minimum time:     6.433 μs (0.00% GC)
  median time:      9.244 μs (0.00% GC)
  mean time:        10.346 μs (4.69% GC)
  maximum time:     118.110 μs (79.93% GC)
  --------------
  samples:          10000
  evals/sample:     3

The only resolution I can think of is allow someone to write x << Val(64) to tell the shift function that the second argument is a compile-time constant (like what is done with ntuple).

@eschnett
Copy link

One could introduce the same mechanism as for ^: x^n for literal n are treated specially by the parser for performance reasons (see Base.literal_pow).

I would argue that ^, <<, and >> have something in common: These functions are not symmetric in the types they accept (in the sense that +, -, *, etc. are symmetric), and the second argument is often a literal, and knowing the literal can lead to much improved code.

Of course, an even better method would be to mark the function as "please inline if the second argument is a known integer", but Julia's JIT infrastructure doesn't have such tags...

@jessymilare
Copy link
Contributor

Of course, an even better method would be to mark the function as "please inline if the second argument is a known integer", but Julia's JIT infrastructure doesn't have such tags...

In case of constant second-argument, the @generated function would simply return a call to the LLVM intrinsic function (inlining everything is still slower than that). A specialized lowering to literal_shift could do that.

@jessymilare
Copy link
Contributor

jessymilare commented Jun 27, 2021

I realized that the compiler can understand an expression like shift(x, y & mask) (where mask is a small constant compile-time value) and emit optimized code for that. This allowed me to simplify the code a lot (no need for switch-case-like block for small y values) and the performance has improved.

Still not as fast as providing a constant value as the second argument to an intrinsic LLVM function, but it got closer:

julia> @btime x1024 .<< s1;
  23.981 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .<< s2;
  25.547 μs (4 allocations: 125.14 KiB)

julia> @btime x1024 .<< s3;
  23.462 μs (4 allocations: 125.14 KiB)

julia> @btime x512 .<< s1;
  11.441 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .<< s2;
  14.012 μs (4 allocations: 62.64 KiB)

julia> @btime x512 .<< s3;
  11.373 μs (4 allocations: 62.64 KiB)

julia> @btime x256 .<< s1;
  2.679 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .<< s2;
  4.672 μs (4 allocations: 31.39 KiB)

julia> @btime x256 .<< s3;
  3.387 μs (4 allocations: 31.39 KiB)

@jessymilare
Copy link
Contributor

jessymilare commented Jun 28, 2021

A final note about performance.

I noted that inlining all shift functions makes them almost as fast as intrinsic shift with constant y (the second argument) and the performance is indeed the same when y is constant. However, that makes the generic code emitted for shifting 1024-bit integers too big. So I added some inline declarations, which makes shifting faster in all cases and inlines code for 256-bit and 512-bit (by a decision of the compiler), but still does not inline shifting 1024-bit (or larger) integers. I think that is the best trade-off.

In short, the only case where calling LLVM intrinsic with a compile-time constant y is faster than using <<, >> and >>> is when x is 1024-bit or larger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants