Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for i in iter #65

Merged
merged 3 commits into from
Mar 5, 2020
Merged

Fix for i in iter #65

merged 3 commits into from
Mar 5, 2020

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Mar 4, 2020

This supports syntax like

for i in iter
    # do stuff
end

Because iter could be a CartesianIndex I presume we have to treat it the same way eachindex(A) gets treated.

Some of this is just whitespace changes (courtesy of Atom), but you can select to ignore it when reviewing. Also note I added a couple of using statements at the top of test files to facilitate running them independently. If you don't like that I can remove that part.

@timholy
Copy link
Contributor Author

timholy commented Mar 4, 2020

One cautionary note: in a test script here I get this:

julia> @btime old2d!($out, $A, $kern);
  67.972 μs (0 allocations: 0 bytes)

julia> @btime avx2d!($out, $A, $kern);
  787.508 μs (0 allocations: 0 bytes)

It seems possible in principle that this is a consequence of this PR.

As you'll see in the gist (for at least some of these), other challenges include:

  • the use of OffsetArrays
  • a lot of code is generic multidimensional
  • element types include N0f8 from the FixedPointNumbers package, which is a mapping of 0x00:0xff->0.0:1/255:1.0 (i.e., 8-bit values that span the range 0.0 to 1.0, inclusive).

@codecov-io
Copy link

codecov-io commented Mar 4, 2020

Codecov Report

Merging #65 into master will decrease coverage by 0.02%.
The diff coverage is 89.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #65      +/-   ##
==========================================
- Coverage   92.95%   92.93%   -0.03%     
==========================================
  Files          25       25              
  Lines        2810     2815       +5     
==========================================
+ Hits         2612     2616       +4     
- Misses        198      199       +1
Impacted Files Coverage Δ
src/LoopVectorization.jl 100% <ø> (ø) ⬆️
src/graphs.jl 84.84% <89.47%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 977528d...b720d11. Read the comment docs.

@timholy
Copy link
Contributor Author

timholy commented Mar 4, 2020

Update: yep, that's a lot of it but not all of it: I added an avx2d_unpack! which converts those into UnitRanges and I get

julia> @btime avx2d_unpack!($out, $A, $kern);
  101.279 μs (0 allocations: 0 bytes)

which is much better but still worse than the @avx-free version. CC @stillyslalom.

@chriselrod
Copy link
Member

chriselrod commented Mar 5, 2020

# include your gist here
using VectorizationBase, BenchmarkTools
# gesp is get element strided pointer
VectorizationBase.stridedpointer(A::OffsetArray) = VectorizationBase.gesp(stridedpointer(parent(A)), -1 .* A.offsets)

function avx2d_unpackall!(out::AbstractMatrix, A::AbstractMatrix, kern::OffsetArray, R=CartesianIndices(out), z=zero(eltype(out)))
    rng1k, rng2k = axes(kern)
    rng1kf, rng1kl = first(rng1k), last(rng1k)
    rng2kf, rng2kl = first(rng2k), last(rng2k)
    rng1,  rng2  = R.indices
    rng1f, rng1l = first(rng1), last(rng1)
    rng2f, rng2l = first(rng2), last(rng2)
    # Manually unpack the OffsetArray
    kernA = parent(kern)
    o1, o2 = kern.offsets
    @avx for j in rng2f:rng2l, i in rng1f:rng1l
        tmp = z
        for jk in rng2kf:rng2kl, ik in rng1kf:rng1kl
            tmp += A[i+ik,j+jk]*kernA[ik-o1,jk-o2]
        end
        out[i,j] = tmp
    end
    out
end

A = rand(Float32, 100, 100);
kern = centered(rand(Float32, 3, 3));
out1 = OffsetArray(similar(A, size(A).-2), 1, 1);   # stay away from the edges of A
out2 = similar(out1); out3 = similar(out1);

@benchmark old2d!($out1, $A, $kern)
@benchmark avx2d_unpack!($out2, $A, $kern)
@benchmark avx2d_unpackall!($out3, $A, $kern)

out1  out2
out1  out3

On my computer (with avx512), this yields

julia> out1 = OffsetArray(similar(A, size(A).-2), 1, 1);   # stay away from the edges of A

julia> out2 = similar(out1); out3 = similar(out1);

julia> @benchmark old2d!($out1, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.879 μs (0.00% GC)
  median time:      66.952 μs (0.00% GC)
  mean time:        67.110 μs (0.00% GC)
  maximum time:     107.237 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark avx2d_unpack!($out2, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     109.412 μs (0.00% GC)
  median time:      109.645 μs (0.00% GC)
  mean time:        109.798 μs (0.00% GC)
  maximum time:     156.803 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark avx2d_unpackall!($out3, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.055 μs (0.00% GC)
  median time:      3.074 μs (0.00% GC)
  mean time:        3.078 μs (0.00% GC)
  maximum time:     5.444 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> out1  out2
true

julia> out1  out3
true

Applying @avx to the outermost loop resulted in a roughly 20x performance gain vs the no-avx version (and 35x vs applying it to the inner loops).

Normally, applying @avx to the outermost loop should yield the greatest benefit, as this gives it more opportunity to get creative with nested loop optimizations.

It may be worth considering passing size information when some loops are much longer than others (eg, the outer loops vs the inner loops) to reduce the risk of it deciding to do something unproductive. E.g., it could decide to 4x unroll and vectorize (8x or 16x factor for single precision depending on avx/avx512), yielding a total of 32x or 64x unrolling. If one loop is only 7 iterations, it had best not do that.

I added "size hint" infrastructure that there currently isn't any way to take advantage of. These are used for optimization decisions, but not for lowering. Currently, the hint is always set to 1024. Perhaps we can use this in some way to make sure it optimizes as though these inner loops are short, while still keeping the number of iterations dynamic.

challenges include: the use of OffsetArrays

I should add the offset array specialization for stridedpointer located at the top of this post somewhere. Thankfully, extending support to new array types is fairly straightforward so long as you can get a pointer to the memory and they are strided.
Or perhaps I ought to add a general definition that uses ntuple(n -> 1-firstindex(A, n), Val(N)) instead of -1 .* A.offsets.

challenges include: a lot of code is generic multidimensional

It would be fantastic to add CartesianIndices support.
The @avx macro calls a generated function. This is used to generate different code depending on whether or not arrays are transposed, as loop order is likely to change as a result. (@_avx is a more naive macro that immediately emits code, making generic assumptions.)
Similarly, it would be possible to make the number of loops change based on the type of the CartesianIndices.
The trick would be substituting the associated index and replacing it with multiple on each ArrayReference.
It would be great to have this, so I'll probably start working on it fairly soon.

challenges include: element types include N0f8 from the FixedPointNumbers package, which is a mapping of

Two problems here. The first is that support is currently lousy for mixing types within the loops. I originally envisioned everything being Float64. I quickly realized that it would be so easy to support other types that it would be silly not to, but extensions from there have all been hacks that haven't really addressed the problem of considering things can be different types.
This should all be redone eventually; while I've fixed almost all the issues that have come up, hacks are brittle and much more likely to crack than an engineered approach.

The second problem -- supporting generic number types -- is much harder.
For anyone reading and wanting a summary of the problem: LoopVectorization uses NTuple{N,Core.VecElement{T}}, where T is something that lets this map to an LLVM vector rather than an LLVM array. LLVM vectors have lots of operations defined on them, many of which map closely to assembly instructions on many architectures, and are how it tells LLVM exactly what it needs to do.

The problem then is that given type foo, and operations a user has defined on foo like + and /, I need to be able to somehow construct these NTuple{N,Core.VecElement{T}}s, and also need to know what the + and / are supposed to do in terms of primitive types.

Here is an issue specific to Complex numbers, and here is an issue about supporting generic types. I also discussed it briefly at the end of this comment, but I don't have any good ideas.

The manual approach of adding support per element type would work, but I am sure there is a means aligned with and showing off Julia's generic power.
Perhaps I could define SVec, the wrapper type for NTuple{W,Core.VecElement{T}} as <:Number. Then a lot of operations would work, as they could sit inside the fields of the struct if they allow holding generic numbers. But then some work is needed to figure out how to load and store elements from arrays of structs.

I'll take a look at what's going wrong with the iterables performance-wise soon.

@timholy
Copy link
Contributor Author

timholy commented Mar 5, 2020

Love it!

challenges include: element types include N0f8 from the FixedPointNumbers package, which is a mapping of

Two problems here. The first is that support is currently lousy for mixing types within the loops...
The second problem -- supporting generic number types -- is much harder...I need to be able to somehow construct these NTuple{N,Core.VecElement{T}}s, and also need to know what the + and / are supposed to do in terms of primitive types.

The ultimate solution seems to be moving this into the compiler, since there you have everything (you know all the types). But in the meanwhile, would it help to use https://github.com/JuliaArrays/TiledIteration.jl and convert chunks of the array to a common numeric type?

@timholy
Copy link
Contributor Author

timholy commented Mar 5, 2020

As an example of your unrolling, it would be nice for handling colors:

using ImageCore
c = RGB(0.12, 0.45, 0.38)

Typically the filter is real-valued but pixels are often color.

@chriselrod
Copy link
Member

I'm going to push to this PR soon. I'm changing how loop bounds are handled.

…rables, and handle different starting offsets of OffsetArrays based on linear vs cartesian indexing.
@chriselrod
Copy link
Member

The ultimate solution seems to be moving this into the compiler, since there you have everything (you know all the types).

Ultimately, it would be great if MLIR (or another project) renders the performance side of LoopVectorization obsolete. But punting the loop to a generated function does give us type information useful for some things, but not nearly as much information as available to the compiler.

But in the meanwhile, would it help to use https://github.com/JuliaArrays/TiledIteration.jl and convert chunks of the array to a common numeric type?

What sort of type conversions can it do?
Many examples should work fine as is.
However, for some problems, TiledIteration would be worth using for the sake of tiling the loops into cache-friendly array sizes, and for it's threading support. By (recursively) dividing matrices into 64x64 blocks and then applying a LoopVectorization-based kernel, Gaius.jl was about 3x faster in single threaded mode on my computer than applying the kernels directly to 1000x1000 matrices.
I think the advantage will be smaller for image filtering problems, because you're not forced to pass over array A O(N) times, but depending on the kernels,

In my recent commits, I now have eachindex (and the symbol fallback) always check the first and last.
Additionally, I realized that while eachindex on OffsetArrays of dimension 1 returns its offsets, when the OffsetArray is of dimension 2 or greater, eachindex returns OneTo. Therefore, whenever the dimension is 2 or greater, offsetting the stridedpointer as I did above would only work if someone isn't eachindexing the array.

@timholy
Copy link
Contributor Author

timholy commented Mar 5, 2020

No rush, I'm playing with an an attempt to add support for external passes to Julia's compiler. The idea is to allow you to set up a callback function that receives the type-inferred CodeInfo with something like

macro avx(ex)
    esc(Expr(:block, Expr(:meta, Expr(:external_pass, :start, avx_pass)), ex, Expr(:meta, Expr(:external_pass, :stop))))
end

where you've defined the function avx_pass to take a Core.Compiler.OptimizationState.

If this works you'll really hate me because it would basically encourage you to rewrite this package from scratch 😦. OTOH in the long run it should make a lot of things a lot easier.

@chriselrod
Copy link
Member

I don't seem to know how to push to PRs; I pushed here by mistake:
https://github.com/chriselrod/LoopVectorization.jl/tree/teh/loops
Now:

julia> @benchmark old2d!($out2, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     67.172 μs (0.00% GC)
  median time:      67.442 μs (0.00% GC)
  mean time:        67.769 μs (0.00% GC)
  maximum time:     111.490 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark avx2d!($out2, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     130.930 μs (0.00% GC)
  median time:      132.447 μs (0.00% GC)
  mean time:        132.576 μs (0.00% GC)
  maximum time:     188.808 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark avx2douter!($out3, $A, $kern)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.662 μs (0.00% GC)
  median time:      2.684 μs (0.00% GC)
  mean time:        2.686 μs (0.00% GC)
  maximum time:     4.959 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

Where they're defined here (and tests pass), using ranges for iteration.

@timholy
Copy link
Contributor Author

timholy commented Mar 5, 2020

Therefore, whenever the dimension is 2 or greater, offsetting the stridedpointer as I did above would only work if someone isn't eachindexing the array.

If it's a IndexLinear() array (which holds for any OffsetArray wrapper around any other kind of IndexLinear() array), and they want cartesian indices, they should be doing for I in CartesianIndices(A).

@timholy
Copy link
Contributor Author

timholy commented Mar 5, 2020

I don't seem to know how to push to PRs; I pushed here by mistake: https://github.com/chriselrod/LoopVectorization.jl/tree/teh/loops

I don't pretend to really understand this package, but FWIW it looks good to me! If you're happy merge whenever. Thanks for the amazing support! I'll keep you posted about the compiler work.

@chriselrod
Copy link
Member

I can merge this and that commit separately, or I can wait until you merge that commit into this PR
https://github.com/timholy/LoopVectorization.jl/pull/1
and then merge it.

On indexing OffsetArray, the problem was that it uses offsets when indexed with cartesian or multiple indexes, but not indexed linearly. LoopVectorization uses fat pointers (that carry their strides) internally for indexing. My original idea was to offset the base of the pointer to compensate for the OffsetArray's offset and return an ordinary stridedpointer, but this wouldn't show the different behaviors in different situations.
So I figured the simplest approach was to just introduce a new type with the same behavior pattern.

Thanks for the PR and interest. I'll definitely be interested in your compiler work, especially if like @code_typed it can strip away most of Julia's abstractions, thus solving the arbitrary struct problem. I'll just have to make sure I don't lose my job for playing with loops all day.

@timholy
Copy link
Contributor Author

timholy commented Mar 5, 2020

Somehow it's not updating that PR. Feel free to just open a PR at that teh/loops branch if that's easier.

@chriselrod chriselrod merged commit 2ae8a78 into JuliaSIMD:master Mar 5, 2020
@timholy timholy deleted the teh/loops branch March 6, 2020 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants