`ntoh` / `bswap` are 10x slower when operating in-place on `reinterpret` array #42227

Moelf · 2021-09-13T00:18:59Z

julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.710 μs … 37.441 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.936 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.776 μs ±  1.804 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▃▅▄▆▇█▆▃                        
  ▂▂▂▂▁▂▁▁▂▁▁▂▁▂▁▂▂▃▃▄▆▇█▇▆▆▅▇█████████▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  10.7 μs         Histogram: frequency by time          26 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.


julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  106.632 μs … 296.513 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     108.787 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.698 μs ±   3.539 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▃ ▁▆▄ ▆█▃▂▇▆▂▁▃▁ ▄▅▁▁▂▂          ▁  ▂▁ ▁▁                 ▂
  ▆▅▄██▆███▇███████████████████▇▇▇▇█▆███▇▇█████▇▇▇▆▇▇▇▇▇▄▅▇▆▆▅▅ █
  107 μs        Histogram: log(frequency) by time        118 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  173.369 μs … 388.629 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     181.555 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.340 μs ±   9.220 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▃▄ ▁█▆▂      ▁ ▂▁▁▃▇▄▁▃▆▂▁▆▇▃ ▁▁ ▁▁ ▁▁▁▁ ▁▂▁▁▁▁     ▂
  ▅▃▁▁▄▁▁▇▄▁█████████▆▇▇███████████████████████████████████▇▆▆▅ █
  173 μs        Histogram: log(frequency) by time        190 μs <

 Memory estimate: 0 bytes, allocs estimate:

Moelf · 2021-09-13T00:35:04Z

It should be able to be fixed by something like:

        t1 = Base.@_gc_preserve_begin rawdata
        real_data = unsafe_wrap(Array, pointer(reinterpret(Int32, rawdata)), 1_000_000÷4)
        real_data .= ntoh.(real_data)
        vov = VectorOfVectors(real_data, rawoffsets, ArraysOfArrays.no_consistency_checks)
        Base.@_gc_preserve_end t1
        return vov

but for some reason GC isn't happy...

jakobnissen · 2021-09-13T10:37:50Z

The inefficiency comes from indexing into the reinterpret'et array being slower than indexing into a normal Array. The ntoh is not relevant. I don't think there are any issues about the inefficiencies of indexing into reinterpreted arrays, at least, I couldn't find any.

Moelf · 2021-09-13T11:34:18Z

I'm thinking there's a (theoretical) shortcut because in-place ntoh of a reinterpret should be as fast as on a normal array since the bytes layout are exactly the same.

Moelf · 2021-09-13T12:29:49Z

I now see a related previous discussion: #25014

timholy · 2021-09-13T16:33:48Z

It may not even have anything to do directly with that issue; instead it looks like a SIMD vectorization issue.

julia> function run_ntoh2!(y, n=1000)    # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
           for i = 1:n
               @inbounds @simd for j in eachindex(y)
                   y[j] = ntoh(y[j])
               end
           end
           return y
       end
run_ntoh2! (generic function with 2 methods)

julia> x = rand(Int32, 1_000_000÷4);

julia> y = reinterpret(reshape, Int32, rand(UInt8, 4, 1_000_000÷4));

julia> @btime run_ntoh2!(x);
  22.733 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh2!(y);
  122.921 ms (1 allocation: 32 bytes)

julia> z = reinterpret(Int32, rand(UInt8, 1_000_000));

julia> @btime run_ntoh2!(z);
  276.921 ms (1 allocation: 32 bytes)

Interestingly, the difference between x and y mostly seems to be that one vectorizes and the other does not. @chriselrod, any ideas? I tried with @turbo but got

ERROR: MethodError: no method matching bswap(::VectorizationBase.Vec{8, Int32})

Moelf · 2021-09-13T18:28:15Z

I also don't understand why z is so much slower than y, shouldn't z be faster because it doesn't have one more Reshape wrapper?

chriselrod · 2021-09-13T18:45:19Z

I tried with @turbo but got

Fixed: JuliaSIMD/VectorizationBase.jl@5636185

Now:

julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> function run_ntoh3!(y, n=1000)    # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
           for i = 1:n
               @turbo for j in eachindex(y)
                   y[j] = ntoh(y[j])
               end
           end
           return y
       end
run_ntoh3! (generic function with 2 methods)

julia> function run_ntoh2!(y, n=1000)    # easier to profile or `@code_llvm` without all the tuning & inference with `@btime`
           for i = 1:n
               @inbounds @simd for j in eachindex(y)
                   y[j] = ntoh(y[j])
               end
           end
           return y
       end
run_ntoh2! (generic function with 2 methods)

julia> x = rand(Int32, 1_000_000÷4);

julia> y = reinterpret(reshape, Int32, rand(UInt8, 4, 1_000_000÷4));

julia> @btime run_ntoh2!($x);
  12.122 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh3!($x);
  12.631 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh2!($y);
  79.918 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh3!($y);
  12.953 ms (0 allocations: 0 bytes)

julia> z = reinterpret(Int32, rand(UInt8, 1_000_000));

julia> @btime run_ntoh2!($z);
  101.174 ms (0 allocations: 0 bytes)

julia> @btime run_ntoh3!($z);
  12.000 ms (0 allocations: 0 bytes)

I didn't add ntoh or bswap to LoopVectorization's cost table, so it is treating them as expensive and not unrolling.

Starting Julia with the environmental variable JULIA_LLVM_ARGS:

JULIA_LLVM_ARGS="--pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize" julia $(MISC_JULIA_FLAGS)

I get

julia> run_ntoh2!(x, 1);
remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 4)

julia> run_ntoh2!(y, 1);
remark: simdloop.jl:75:0: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
remark: simdloop.jl:75:0: loop not vectorized

julia> run_ntoh2!(z, 1);
remark: simdloop.jl:75:0: the cost-model indicates that interleaving is not beneficial
remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 1)

So x and z are vectorized, while y is not. I tried adding @simd ivdep, but y still refused to vectorize because of the "unsafe dependent memory operations in loop".
While z did vectorize, the generated code was awful.

Here is the loop for run_ntoh2!(x,1):

.LBB0_4:                                # %vector.body
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovdqu ymm1, ymmword ptr [rsi + 4*rdx]
        vmovdqu ymm2, ymmword ptr [rsi + 4*rdx + 32]
        vmovdqu ymm3, ymmword ptr [rsi + 4*rdx + 64]
        vmovdqu ymm4, ymmword ptr [rsi + 4*rdx + 96]
        vpshufb ymm1, ymm1, ymm0
        vpshufb ymm2, ymm2, ymm0
        vpshufb ymm3, ymm3, ymm0
        vpshufb ymm4, ymm4, ymm0
        vmovdqu ymmword ptr [rsi + 4*rdx], ymm1
        vmovdqu ymmword ptr [rsi + 4*rdx + 32], ymm2
        vmovdqu ymmword ptr [rsi + 4*rdx + 64], ymm3
        vmovdqu ymmword ptr [rsi + 4*rdx + 96], ymm4
        add     rdx, 32
        cmp     r10, rdx
        jne     .LBB0_4

4 loads, 4 shuffles, 4 stores to process 4*8 = 32 Int32.

While this is the nonsense LLVM produces for run_ntoh2!(z,1):

.LBB0_4:                                # %vector.body
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovdqu ymm4, ymmword ptr [rcx + 4*rbx]
        vpmovdb xmm4, ymm4
        vmovdqu xmm5, xmmword ptr [rcx + 4*rbx]
        vmovdqu xmm6, xmmword ptr [rcx + 4*rbx + 16]
        vmovdqa xmm7, xmm5
        vpermt2b        xmm7, xmm8, xmm6
        vmovdqa xmm0, xmm5
        vpermt2b        xmm0, xmm1, xmm6
        vpermt2b        ymm5, ymm2, ymm6
        vpslld  ymm5, ymm5, 24
        vpmovzxbd       ymm0, xmm0              # ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
        vpslld  ymm0, ymm0, 16
        vpor    ymm0, ymm5, ymm0
        vpmovzxbd       ymm5, xmm7              # ymm5 = xmm7[0],zero,zero,zero,xmm7[1],zero,zero,zero,xmm7[2],zero,zero,zero,xmm7[3],zero,zero,zero,xmm7[4],zero,zero,zero,xmm7[5],zero,zero,zero,xmm7[6],zero,zero,zero,xmm7[7],zero,zero,zero
        vpslld  ymm5, ymm5, 8
        vpmovzxbd       ymm4, xmm4              # ymm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero,xmm4[4],zero,zero,zero,xmm4[5],zero,zero,zero,xmm4[6],zero,zero,zero,xmm4[7],zero,zero,zero
        vpternlogd      ymm4, ymm0, ymm5, 254
        vpshufb ymm0, ymm4, ymm3
        vpmovdb xmm4, ymm0
        vpsrld  ymm5, ymm0, 8
        vpmovdb xmm5, ymm5
        vpunpcklbw      xmm4, xmm4, xmm5        # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
        vpsrld  ymm5, ymm0, 16
        vpmovdb xmm5, ymm5
        vpsrld  ymm0, ymm0, 24
        vpmovdb xmm0, ymm0
        vpunpcklbw      xmm0, xmm5, xmm0        # xmm0 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
        vpunpcklwd      xmm5, xmm4, xmm0        # xmm5 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
        vpunpckhwd      xmm0, xmm4, xmm0        # xmm0 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
        vmovdqu xmmword ptr [rcx + 4*rbx], xmm5
        vmovdqu xmmword ptr [rcx + 4*rbx + 16], xmm0
        add     rbx, 8
        cmp     r10, rbx
        jne     .LBB0_4

Processing just 8 Int32, i.e. doing a lot more work for calculating many fewer results.
Not sure what is going wrong; I'd have to spend more time looking at it to get some idea.

We could add ntoh and/or byteswap to LV's cost table if we want it to be unrolled more. That'd probably help for much smaller arrays; at this size it's memory bottlenecked and thus doesn't make a difference.

Moelf · 2021-09-13T20:20:18Z

does the LV version not work with reinterpret float array?

chriselrod · 2021-09-13T21:46:05Z

does the LV version not work with reinterpret float array?

Yes, it works with x, y, and z above, getting similar performance in each case.

Moelf · 2021-09-13T22:38:53Z

but they are all reinterpreted into Int32, what about Float32 and Float64

chriselrod · 2021-09-13T22:40:45Z

I'll have to add the missing bswap methods.
EDIT: JuliaSIMD/VectorizationBase.jl@c775fe9

timholy · 2021-09-14T01:46:19Z

I also don't understand why z is so much slower than y, shouldn't z be faster because it doesn't have one more Reshape wrapper?

It's precisely because of that "additional reshape" (which is just a different kind of ReinterpretArray, not an additional wrapper): #37559. Clearly it didn't fix absolutely everything (witness this issue), but the performance gains in a few cases were as big as 20x.

Moelf · 2021-09-14T01:55:32Z

so we definitely could use another fix in Base such that in-place ntoh on ReinterpretArray is potentially 10x faster?

timholy · 2021-09-15T08:29:03Z

It would be nice to know which specific operations are judged as unsafe for vectorization by LLVM.

Moelf · 2021-09-15T23:58:00Z

yeah something is systematically inhibiting vectorization(?) of ReinterpretArray:

julia> @benchmark sum(a) setup=(a = reinterpret(Int32, rand(UInt8, 10^4*4)))
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
 Range (min … max):  610.246 ns …  1.442 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     629.880 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   635.574 ns ± 35.472 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▆█▆▂▁▂▂▁▁                                                 ▂
  ▇▁██████████▇▇▇▅▄▄▅▃▅▃▅▁▁▃▃▁▁▁▁▁▁▁▃▃▁▁▁▃▃▁▁▁▃▁▁▁▅▃▄▃▃▄▄▃▁▄▄▄ █
  610 ns        Histogram: log(frequency) by time       881 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(b) setup=(b = rand(Int32, 10^4))
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
 Range (min … max):  507.021 ns … 840.611 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     517.041 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   519.393 ns ±  11.290 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▂▃▅██▇▆▅▅▄▂▂▁        ▁                                 ▂
  ▄▅▆▇█████████████████▇▇▆▇███████▇▇▆▆▆▆▇▆▇▆▇▇▆▅▆▅▅▅▅▄▅▅▄▄▄▆▅▆▆ █
  507 ns        Histogram: log(frequency) by time        558 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

timholy · 2021-09-16T06:14:36Z

Don't use reinterpret(T, a). Use reinterpret(reshape, T, a). The following display uses JuliaCI/BenchmarkTools.jl#255 to make it easier to compare the histograms:

As you can see, the first and third are much closer in terms of time.

Moelf · 2021-09-16T11:33:50Z

yeah but the raw bytes are of course read in as a Vector, would you recommend:

julia> a = rand(UInt8, 100);

julia> reinterpret(Int32, a);

julia> reinterpret(reshape, Int32, a);
ERROR: ArgumentError: `reinterpret(reshape, Int32, a)` where `eltype(a)` is UInt8 requires that `axes(a, 1)` (got Base.OneTo(100)) be equal to 1:4 (from the ratio of element sizes)
Stacktrace:
 [1] (::Base.var"#throwsize1#282")(a::Vector{UInt8}, T::Type)
   @ Base ./reinterpretarray.jl:59
 [2] reinterpret(#unused#::typeof(reshape), #unused#::Type{Int32}, a::Vector{UInt8})
   @ Base ./reinterpretarray.jl:72
 [3] top-level scope
   @ REPL[3]:1

julia> reinterpret(reshape, Int32, reshape(a, 4, 25));

EDIT: eh, this double reshape appears to be faster irl too...

Moelf · 2022-02-12T20:41:27Z

this has been fixed somehow, partially:

julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.630 μs … 38.643 μs  ┊ GC (min … max): 0.00% … 0.00%


julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   97.315 μs … 146.138 μs  ┊ GC (min … max): 0.00% … 0.00%

julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.010 μs … 57.449 μs  ┊ GC (min … max): 0.00% … 0.00%

vtjnash · 2022-02-13T04:19:52Z

Does #43955 help any?

Moelf · 2022-02-13T05:58:24Z

julia> @time sum(length, tree.Muon_pt)
  2.881126 seconds (20.48 k allocations: 2.378 GiB, 0.93% gc time)
149322456

julia> @time sum(length, tree.Muon_pt)
  2.918528 seconds (21.00 k allocations: 2.378 GiB, 1.12% gc time)
149322456

looks like free speedup to me

julia> @benchmark x .= ntoh.(x) setup=begin x = rand(UInt32,1_000_000÷4) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.239 μs … 38.322 μs  ┊ GC (min … max): 0.00% … 0.00%


julia> @benchmark begin y = reinterpret(Int32, x); map!(ntoh, y, y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  113.273 μs … 146.015 μs  ┊ GC (min … max): 0.00% … 0.00%

julia> @benchmark begin y = reinterpret(Int32, x); y .= ntoh.(y) end setup=begin x = rand(UInt8,1_000_000) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.159 μs … 37.931 μs  ┊ GC (min … max): 0.00% … 0.00%

the in-place seems to be better indeed!

timholy mentioned this issue Sep 16, 2021

Overhaul display of a Trial JuliaCI/BenchmarkTools.jl#217

Merged

Moelf mentioned this issue Feb 12, 2022

[WIP] resurrecting in place ntoh JuliaHEP/UnROOT.jl#112

Closed

Moelf closed this as completed May 6, 2022

Moelf mentioned this issue Feb 26, 2023

in-place broadcast (e.g. .+=) significanctly slower for reinterpreted array #48801

Open

Moelf mentioned this issue Oct 10, 2023

30x slower looping over reinterpret array #51658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ntoh` / `bswap` are 10x slower when operating in-place on `reinterpret` array #42227

`ntoh` / `bswap` are 10x slower when operating in-place on `reinterpret` array #42227

Moelf commented Sep 13, 2021

Moelf commented Sep 13, 2021

jakobnissen commented Sep 13, 2021

Moelf commented Sep 13, 2021

Moelf commented Sep 13, 2021 •

edited

Loading

timholy commented Sep 13, 2021

Moelf commented Sep 13, 2021 •

edited

Loading

chriselrod commented Sep 13, 2021

Moelf commented Sep 13, 2021

chriselrod commented Sep 13, 2021

Moelf commented Sep 13, 2021

chriselrod commented Sep 13, 2021 •

edited

Loading

timholy commented Sep 14, 2021

Moelf commented Sep 14, 2021

timholy commented Sep 15, 2021

Moelf commented Sep 15, 2021 •

edited

Loading

timholy commented Sep 16, 2021 •

edited

Loading

Moelf commented Sep 16, 2021 •

edited

Loading

Moelf commented Feb 12, 2022

vtjnash commented Feb 13, 2022

Moelf commented Feb 13, 2022 •

edited

Loading

ntoh / bswap are 10x slower when operating in-place on reinterpret array #42227

ntoh / bswap are 10x slower when operating in-place on reinterpret array #42227

Comments

Moelf commented Sep 13, 2021

Moelf commented Sep 13, 2021

jakobnissen commented Sep 13, 2021

Moelf commented Sep 13, 2021

Moelf commented Sep 13, 2021 • edited Loading

timholy commented Sep 13, 2021

Moelf commented Sep 13, 2021 • edited Loading

chriselrod commented Sep 13, 2021

Moelf commented Sep 13, 2021

chriselrod commented Sep 13, 2021

Moelf commented Sep 13, 2021

chriselrod commented Sep 13, 2021 • edited Loading

timholy commented Sep 14, 2021

Moelf commented Sep 14, 2021

timholy commented Sep 15, 2021

Moelf commented Sep 15, 2021 • edited Loading

timholy commented Sep 16, 2021 • edited Loading

Moelf commented Sep 16, 2021 • edited Loading

Moelf commented Feb 12, 2022

vtjnash commented Feb 13, 2022

Moelf commented Feb 13, 2022 • edited Loading

`ntoh` / `bswap` are 10x slower when operating in-place on `reinterpret` array #42227

`ntoh` / `bswap` are 10x slower when operating in-place on `reinterpret` array #42227

Moelf commented Sep 13, 2021 •

edited

Loading

Moelf commented Sep 13, 2021 •

edited

Loading

chriselrod commented Sep 13, 2021 •

edited

Loading

Moelf commented Sep 15, 2021 •

edited

Loading

timholy commented Sep 16, 2021 •

edited

Loading

Moelf commented Sep 16, 2021 •

edited

Loading

Moelf commented Feb 13, 2022 •

edited

Loading