Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance #125

Closed
johnnychen94 opened this issue Jun 4, 2019 · 6 comments · Fixed by #206
Closed

poor performance #125

johnnychen94 opened this issue Jun 4, 2019 · 6 comments · Fixed by #206

Comments

@johnnychen94
Copy link
Collaborator

johnnychen94 commented Jun 4, 2019

Actually, this surprises me a bit, any hints/ideas on how to fix this?

A similar result can be observed for *,.*, and etc.

julia> x_n0f8 = rand(N0f8, 1000, 1000);

julia> x_float64 = rand(Float64, 1000, 1000);

julia> x_uint8 = rand(UInt8, 1000, 1000);

julia> @benchmark x_n0f8 .+ x_n0f8
BenchmarkTools.Trial: 
  memory estimate:  976.84 KiB
  allocs estimate:  6
  --------------
  minimum time:     3.547 ms (0.00% GC)
  median time:      3.732 ms (0.00% GC)
  mean time:        3.838 ms (1.11% GC)
  maximum time:     7.032 ms (0.00% GC)
  --------------
  samples:          1301
  evals/sample:     1

julia> @benchmark x_float64 .+ x_float64
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  4
  --------------
  minimum time:     1.214 ms (0.00% GC)
  median time:      1.346 ms (0.00% GC)
  mean time:        1.869 ms (29.19% GC)
  maximum time:     5.726 ms (71.81% GC)
  --------------
  samples:          2667
  evals/sample:     1

julia> @benchmark x_uint8 .+ x_uint8
BenchmarkTools.Trial: 
  memory estimate:  976.75 KiB
  allocs estimate:  4
  --------------
  minimum time:     80.220 μs (0.00% GC)
  median time:      87.781 μs (0.00% GC)
  mean time:        235.548 μs (18.39% GC)
  maximum time:     1.844 ms (69.85% GC)
  --------------
  samples:          10000
  evals/sample:     1
@timholy
Copy link
Member

timholy commented Jun 4, 2019

I am surprised too. Best approach would be to profile it and see where it's spending time. Or step into it with a debugger and see which version of + is getting called.

@kimikage
Copy link
Collaborator

kimikage commented Oct 11, 2019

One reason why x_float64 .+ x_float64 is faster, is that the SIMD optimization is relatively well carried out. However, I don't know yet why x_n0f8 .+ x_n0f8 is slow.

Although it is not directly related to the subject of this issue, I think * for Normed is inefficient since it converts the parameters into Float and converts the multiplied back to UInt.

*(x::T, y::T) where {T <: Normed} = convert(T,convert(floattype(T), x)*convert(floattype(T), y))

The following may be faster, although it should be generalized.

function Base.:*(x::N0f8, y::N0f8)
    m = UInt32(reinterpret(x)) * UInt32(reinterpret(y))
    reinterpret(N0f8, unsafe_trunc(UInt8, (m * 0x10101 + 0x800000)>>24))
end

Edit:
Historically, the float-based multiplication was introduced in PR #36. Of course, it is quite right as a bug fix.

@kimikage
Copy link
Collaborator

As mentioned in #129 (comment), this problem seems to be OS-dependent. Unfortunately, Julia v1.3.0 will not directly improve the performance.

It is not easy to modify the broadcasting mechanism. I think it may be a realistic idea to modify the
methods in FixedPointNumbers to be suitable for the broadcasting.

@kimikage
Copy link
Collaborator

kimikage commented Oct 29, 2019

Although also not directly related to the subject of this issue, the following tips are beneficial:
https://docs.julialang.org/en/v1.2/manual/performance-tips/#Pre-allocating-outputs-1
https://docs.julialang.org/en/v1.2/manual/performance-tips/#Consider-using-views-for-slices-1
https://docs.julialang.org/en/v1.2/manual/performance-tips/#Copying-data-is-not-always-bad-1

function prealloc_broadcast(f, a::T, b::T) where {T <: AbstractArray}
    y = copy(a)
    broadcast!(f, y, y, b)
end

function broadcast_with_view(f, a::T, b::T) where {T <: AbstractArray}
    broadcast(f, view(a, axes(a)...), view(b, axes(b)...))
end

function prealloc_broadcast_with_view(f, a::T, b::T) where {T <: AbstractArray}
    y = copy(a)
    broadcast!(f, y, y, view(b, axes(b)...))
end
julia> @benchmark broadcast(+, x_n0f8, x_n0f8)
BenchmarkTools.Trial:
  memory estimate:  976.80 KiB
  allocs estimate:  4
  --------------
  minimum time:     3.427 ms (0.00% GC)
  median time:      3.431 ms (0.00% GC)
  mean time:        3.520 ms (1.14% GC)
  maximum time:     32.200 ms (87.01% GC)
  --------------
  samples:          1419
  evals/sample:     1

julia> @benchmark prealloc_broadcast(+, x_n0f8, x_n0f8)
BenchmarkTools.Trial:
  memory estimate:  976.75 KiB
  allocs estimate:  3
  --------------
  minimum time:     2.505 ms (0.00% GC)
  median time:      2.510 ms (0.00% GC)
  mean time:        2.542 ms (1.13% GC)
  maximum time:     30.100 ms (91.65% GC)
  --------------
  samples:          1965
  evals/sample:     1

julia> @benchmark broadcast_with_view(+, x_n0f8, x_n0f8)
BenchmarkTools.Trial:
  memory estimate:  976.80 KiB
  allocs estimate:  4
  --------------
  minimum time:     2.238 ms (0.00% GC)
  median time:      2.244 ms (0.00% GC)
  mean time:        2.279 ms (1.42% GC)
  maximum time:     31.449 ms (92.86% GC)
  --------------
  samples:          2191
  evals/sample:     1

julia> @benchmark prealloc_broadcast_with_view(+, x_n0f8, x_n0f8)
BenchmarkTools.Trial:
  memory estimate:  976.75 KiB
  allocs estimate:  3
  --------------
  minimum time:     1.800 ms (0.00% GC)
  median time:      1.804 ms (0.00% GC)
  mean time:        1.833 ms (1.57% GC)
  maximum time:     31.257 ms (94.21% GC)
  --------------
  samples:          2723
  evals/sample:     1

@kimikage
Copy link
Collaborator

I found the significant difference between x_n0f8 and x_uint8.

julia> typeof(x_n0f8)
Base.ReinterpretArray{Normed{UInt8,8},2,UInt8,Array{UInt8,2}}

julia> typeof(x_uint8)
Array{UInt8,2}

So, as you already know...

julia> z_n0f8 = replace!(_->rand(N0f8), Array{N0f8}(undef, 1000, 1000));

julia> typeof(z_n0f8)
Array{Normed{UInt8,8},2}

julia> @benchmark z_n0f8 .+ z_n0f8
BenchmarkTools.Trial:
  memory estimate:  976.75 KiB
  allocs estimate:  4
  --------------
  minimum time:     76.600 μs (0.00% GC)
  median time:      85.300 μs (0.00% GC)
  mean time:        105.110 μs (11.17% GC)
  maximum time:     30.517 ms (99.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

Are the ReinterpretArrays of primitive types designed to be fast or faster than the native Arrays? If so, this is probably a problem with Julia language. If not, we should be modify the rand.

rand(::Type{T}, sz::Dims) where {T <: FixedPoint} = reinterpret(T, rand(rawtype(T), sz))

@timholy
Copy link
Member

timholy commented Oct 30, 2019

Yes, this is a limitation of ReinterpretArray. See JuliaLang/julia#28980.

A (nasty) workaround is to steal the pointer and create a new array with unsafe_wrap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants