-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up logical indexing by bitarray #29746
Conversation
1f81b7b
to
17fd485
Compare
So if anyone wants to benchmark alternatives on other architectures:
and
Ultimately, this will probably be a tradeoff between which architectures get the better code, as well as whether dense or sparse selection is optimized. The version in the PR has the advantage of being a strict improvement over the old code (on my machine). I have no idea what is fast on ARM, and Agner Fog's table suggests that |
@mbauman Can you trigger a nanosoldier run? I think the travis-ci fail is travis' problem, not this PR's problem. Unless there are change requests, I'm done with this PR, which supersedes #29660. The PS: Sorry that this comment showed up multiple times. No idea why github or my browser messed this up. |
This is awesome, thank you! @nanosoldier |
base/multidimensional.jl
Outdated
Bc = L.mask.chunks | ||
i1, c = s | ||
while c==0 | ||
i1 == length(Bc) && return nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of @propagate_inbounds
, it'd be a bit nicer to do i1 >= length(Bc)
(or even that unsigned trick to account for negative i1
s). Then we could just always use @inbounds Bc[i1]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure about the desired safety here. If the iteration interface is used correctly, this will always be be inbounds. If we get passed an OOB-state, do we need to raise a correct error, are we allowed to return garbage (as long as no OOB read occurs) or are we allowed to perform an OOB read (possibly with segfault) and then return garbage?
The TEST instead of CMP was actually cribbed from the iteration code for vectors. Benchmarks and Agner Fog agree that there is no price difference between these instructions. I could change to i1 >= length(Bc)
, with the disadvantage that misuse silently returns nothing
instead of throwing an OOB error in julia --check-bounds=yes
, and the advantage that we won't read OOB on most misuses, even in julia --check-bounds=no
. I think we should prefer the OOB error?
For some reason the UInt
version is much slower. I tried the following, with no success:
@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray})
L.sum == 0 && return nothing
Bc = L.mask.chunks
return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (UInt(0), @inbounds Bc[1]))
end
@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s::Tuple{UInt,UInt64})
Bc = L.mask.chunks
i1, c = s
while c==0
i1 += 1
i1 < length(Bc) || return nothing
@inbounds c = Bc[i1+1]
end
tz = trailing_zeros(c)+1
c = _BLSR(c)
return (i1<<6 +tz, (i1, c))
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the disadvantage that misuse silently returns nothing instead of throwing an OOB error
julia> iterate([1], 2)
julia> iterate((1,), 2)
totally fine to return nothing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got a fast UInt version now. Only question is about what is our preferred error behavior: nothing
vs OOB error with check-bounds
and OOB read with possible segfault when given an @inbounds
invalid state. I still prefer the OOB error, tbh. Do we have a policy for that kind of question?
@inline _BLSR(x)= x & (x-1)
@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray})
L.sum == 0 && return nothing
Bc = L.mask.chunks
return Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
end
@inline function Base.iterate(L::Base.LogicalIndex{Int,<:BitArray}, s)
Bc = L.mask.chunks
i1, c = s
while c==0
i1%UInt >= length(Bc)%UInt && return nothing
i1 += 1
@inbounds c = Bc[i1]
end
tz = trailing_zeros(c)+1
c = _BLSR(c)
return ((i1-1)<<6 +tz, (i1, c))
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When in doubt, do as the Array
s".
So, couldn't just something like
Line 707 in 0e023d0
iterate(A::Array, i=1) = (@_inline_meta; (i % UInt) - 1 < length(A) ? (@inbounds A[i], i + 1) : nothing) |
be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays currently give nothing for OOB states, and ranges silently return garbage:
julia> iterate(1:10, 17)
(18, 18)
julia> iterate(collect(1:10), 17) === nothing
true
In view of that, I guess it would be consistent enough to return nothing
and reap the speedup when there is no @inbounds
annotation. So I'll change that.
base/multidimensional.jl
Outdated
end | ||
tz = trailing_zeros(c) + 1 | ||
c = _BLSR(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really clever idea. Nice work.
base/bitarray.jl
Outdated
@@ -84,6 +84,7 @@ IndexStyle(::Type{<:BitArray}) = IndexLinear() | |||
const _msk64 = ~UInt64(0) | |||
@inline _div64(l) = l >> 6 | |||
@inline _mod64(l) = l & 63 | |||
@inline _BLSR(x)= x & (x-1) #zeros the last set bit. Has native instruction on many archs. needed in multidimensional.jl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pickiest possible nit: maybe _blsr
to match the surrounding capitalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended.
924f1a1
to
ada8bc8
Compare
switched spelling of _blsr
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Hm, strange that this seems to show both time and memory regression in Nanosoldier. They seem to be real: julia> function perf_sumlogical(A)
s = zero(eltype(A))
nrows = size(A, 1)
ncols = size(A, 2)
r = falses(nrows)
r[1:4:end] .= true
@simd for i = 1:ncols
val = @inbounds A[r, i]
s += first(val)
end
return s
end
perf_sumlogical (generic function with 1 method)
julia> using BenchmarkTools
julia> A = rand(100,100);
julia> @btime perf_sumlogical($A);
24.910 μs (203 allocations: 31.44 KiB)
julia> @eval Base begin
@inline _blsr(x)= x & (x-1)
@inline function iterate(L::Base.LogicalIndex{Int,<:BitArray})
L.sum == 0 && return nothing
Bc = L.mask.chunks
return iterate(L::Base.LogicalIndex{Int,<:BitArray}, (1, @inbounds Bc[1]))
end
@inline function iterate(L::Base.LogicalIndex{Int,<:BitArray}, s)
Bc = L.mask.chunks
i1, c = s
while c==0
i1 % UInt >= length(Bc) % UInt && return nothing
i1 += 1
@inbounds c = Bc[i1]
end
tz = trailing_zeros(c) + 1
c = _blsr(c)
return ((i1-1)<<6 + tz, (i1, c))
end
end
iterate (generic function with 201 methods)
julia> @btime perf_sumlogical($A);
86.771 μs (603 allocations: 51.75 KiB) The culprit seems to be the type assertion I've tagged below. Remove it and: julia> @btime perf_sumlogical($A);
14.063 μs (203 allocations: 31.44 KiB) |
Co-Authored-By: chethega <chethega@gmail.com>
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
* speed up logical indexing by bitarray * changed inbounds / OOB behavior to match array iterators switched spelling of _blsr * Update base/multidimensional.jl Co-Authored-By: chethega <chethega@gmail.com> (cherry picked from commit 44f2563)
* speed up logical indexing by bitarray * changed inbounds / OOB behavior to match array iterators switched spelling of _blsr * Update base/multidimensional.jl Co-Authored-By: chethega <chethega@gmail.com> (cherry picked from commit 44f2563)
* speed up logical indexing by bitarray * changed inbounds / OOB behavior to match array iterators switched spelling of _blsr * Update base/multidimensional.jl Co-Authored-By: chethega <chethega@gmail.com> (cherry picked from commit 44f2563)
* speed up logical indexing by bitarray * changed inbounds / OOB behavior to match array iterators switched spelling of _blsr * Update base/multidimensional.jl Co-Authored-By: chethega <chethega@gmail.com> (cherry picked from commit 44f2563)
* speed up logical indexing by bitarray * changed inbounds / OOB behavior to match array iterators switched spelling of _blsr * Update base/multidimensional.jl Co-Authored-By: chethega <chethega@gmail.com> (cherry picked from commit 44f2563)
The recent PR #29660 indicated that it is possible to significantly speed up logical indexing by bitarrays. So I consulted my Agner Fog, and wrote something faster. Also, this will need benchmarking on a variety of architectures, since it depends on the
BLSR
andTZCNT
instructions. The timings are on a 2 GHZ Broadwell, so multiply by two to get cycles / selected index.Setup:
Before:
after:
The main difference to my proposal in #29660 is the following: We need one TZCNT per index, which has 3 cycles latency. We can avoid long dependency chains on TZCNT by using BLSR instead, with one cycle latency. Without this trick, this slows down significantly. I still don't get how this adds up to 3-5 cycles per index, but that's as fast as I could make it.
The main difference to the original code is that we replace hard to predict branches by dependency chains.