wrap range variable in a let block for at_threads #24688

vchuravy · 2017-11-21T16:46:17Z

This works around performance issues I have been seeing due to #15276.

In particular this function showed no performance scaling when i recently introduced threading to students:

function threaded_sum(arr)
    @assert length(arr) % nthreads() == 0
    results::Vector{Float64} = zeros(eltype(arr), nthreads())
    @threads for tid in 1:nthreads()
        # split work
        len = div(length(arr), nthreads())
        domain = ((tid-1)*len +1):tid*len
        acc = zero(eltype(arr))
        @simd for i in domain
            @inbounds acc += arr[i]
        end
        results[tid] = acc
    end
    sum(results)
end

julia> data = rand(100_000);
julia> @benchmark threaded_sum($data)
BenchmarkTools.Trial: 
  memory estimate:  6.09 MiB
  allocs estimate:  398994
  --------------
  minimum time:     7.674 ms (0.00% GC)
  median time:      8.117 ms (0.00% GC)
  mean time:        8.681 ms (1.45% GC)
  maximum time:     31.901 ms (75.30% GC)
  --------------
  samples:          576
  evals/sample:     1
@benchmark my_sum($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.491 μs (0.00% GC)
  median time:      14.117 μs (0.00% GC)
  mean time:        23.058 μs (0.00% GC)
  maximum time:     417.783 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

These numbers are for single threaded performance and where my_sum is simple threaded_sum without @threads.

Adding code_warntype(STDOUT, threadsfor_fun, (Bool,)) in line

julia/base/threadingconstructs.jl

Line 65 in 52d81b0

end

Shows that the example function which is type-stable without @threads becomes https://gist.github.com/vchuravy/e60e0e63b79de59f92c95a5864c66f37.

The change in this PR introduces a let block for the range variable eliminating almost all type-instability https://gist.github.com/vchuravy/0a4203674d02c92afee9b10b58756bbe and drastically improves performance:

julia> @benchmark threaded_sum($data)
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  3
  --------------
  minimum time:     11.806 μs (0.00% GC)
  median time:      11.917 μs (0.00% GC)
  mean time:        12.687 μs (0.00% GC)
  maximum time:     260.019 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Working around #15276

Working around #15276 Ref #24688 (cherry picked from commit 7704b09)

ceesb · 2018-06-03T21:46:06Z

The newly introduced "let" makes my threaded code 2 times slower between 0.6.2 and 0.6.3, even when running with JULIA_NUM_THREADS=1 .. I have trouble debugging what happens here: code_warntype shows me nothing weird, and profiling my code in 0.6.2 versus 0.6.3 doesn't show a change in hotspots. How can I find out what's happening?

Just for my sanity I reverted this change in 0.6.3 and then the performance is as it was.

Maybe also useful to note is that if I remove at_threads from my loop altogether, I get the performance in 0.6.3 I would expect from a single-threaded loop (which is the same as in 0.6.2 with at_threads and JULIA_NUM_THREADS=1).

vchuravy · 2018-06-03T23:22:34Z

Can you post a MWE? The problem is that code_warntype doesn't see the closure used for the thread function.

…

On Sun, Jun 3, 2018, 14:46 Cees-Bart Breunesse ***@***.***> wrote: The newly introduced "let" makes my threaded code 2 times slower between 0.6.2 and 0.6.3, even when running with JULIA_NUM_THREADS=1 .. I have trouble debugging what happens here: code_warntype shows me nothing weird, and profiling my code in 0.6.2 versus 0.6.3 doesn't show a change in hotspots. How can I find out what's happening? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#24688 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI3akdoBn2NSJhvCI-o8uGKqTZoCkQVks5t5FkqgaJpZM4QmJmv> .

ceesb · 2018-06-04T20:12:28Z

I'm struggling to minimize it, but I attached something that runs by itself: standalone.zip.

Timings on my laptop:

For 0.6.3:
$ JULIA_NUM_THREADS=1 julia standalone.jl
12.961506 seconds (151.02 k allocations: 527.064 MiB, 2.17% gc time)

For 0.6.2:
$ JULIA_NUM_THREADS=1 julia standalone.jl
6.124962 seconds (158.29 k allocations: 527.286 MiB, 4.68% gc time)

I used at_code_warntype on threadsfor_fun and indeed it looks much cleaner with the let-block in 0.6.3, I just don't see what ruins the performance (in my experience this usually means something is ruining the cache). Any pointers are appreciated.

vchuravy · 2018-06-04T23:03:35Z

Do you have the same performance drop if you just remove @threads? You mentioned that your performance with JULIA_NUM_THREADS=1 also dropped.
How is the performance of JULIA_NUM_THREADS=1 with @threads compared to without @threads.

ceesb · 2018-06-04T23:09:58Z

Without @threads on 0.6.2:
$ julia standalone.jl
6.187303 seconds (144.19 k allocations: 526.704 MiB, 4.49% gc time)

Without @threads on 0.6.3:
$ julia standalone.jl
6.147939 seconds (144.41 k allocations: 526.705 MiB, 4.41% gc time)

ceesb · 2018-06-05T20:32:51Z

OK .. I think I found it .. It looks like that without the let in Julia 0.6.2, because of all the Any types in threadsfor_fun, my state variable in the threaded loop was passed as a pointer (which is good):

function flushcache!(state::IncrementalCovarianceTiled)
  if state.cacheCount == 0
    return
  end

  Threads.@threads for y in 1:state.nrTilesY
    dothreadwork(state,y)
  end

  state.cacheCount = 0
  return
end

But due to the let and the type propagation, it looks like somewhere a copy is made of state, screwing up the cache friendliness of my code. I fixed it by wrapping state into a Ref:

function flushcache!(state::IncrementalCovarianceTiled)
  if state.cacheCount == 0
    return
  end

  stateref = Ref{IncrementalCovarianceTiled}(state)

  Threads.@threads for y in 1:state.nrTilesY
    dothreadwork(stateref,y)
  end

  state.cacheCount = 0
  return
end

wrap range variable in a let block for at_threads

e6b290a

Working around #15276

JeffBezanson approved these changes Nov 21, 2017

View reviewed changes

ararslan added the domain:multithreading Base.Threads and related functionality label Nov 21, 2017

JeffBezanson merged commit 7704b09 into master Nov 21, 2017

JeffBezanson deleted the vc/let_threaded branch November 21, 2017 22:49

JeffBezanson added the performance Must go faster label Nov 21, 2017

jpfairbanks mentioned this pull request Dec 7, 2017

Box allocation suboptimal with control flow #24971

Closed

tkoolen mentioned this pull request May 16, 2018

Backports for Julia v0.6.3 #26915

Merged

7 tasks

ararslan pushed a commit that referenced this pull request May 16, 2018

wrap range variable in a let block for at_threads (#24688)

32b62cf

Working around #15276 Ref #24688 (cherry picked from commit 7704b09)

ararslan pushed a commit that referenced this pull request May 27, 2018

wrap range variable in a let block for at_threads (#24688)

57f742a

Working around #15276 Ref #24688 (cherry picked from commit 7704b09)

vchuravy mentioned this pull request Jul 24, 2018

No performance scaling in threading #17395

Closed

vchuravy mentioned this pull request Jul 6, 2019

Fast-track @threads when nthreads() == 1 #32181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrap range variable in a let block for at_threads #24688

wrap range variable in a let block for at_threads #24688

vchuravy commented Nov 21, 2017

ceesb commented Jun 3, 2018 •

edited

Loading

vchuravy commented Jun 3, 2018 via email

ceesb commented Jun 4, 2018

vchuravy commented Jun 4, 2018

ceesb commented Jun 4, 2018

ceesb commented Jun 5, 2018

wrap range variable in a let block for at_threads #24688

wrap range variable in a let block for at_threads #24688

Conversation

vchuravy commented Nov 21, 2017

ceesb commented Jun 3, 2018 • edited Loading

vchuravy commented Jun 3, 2018 via email

ceesb commented Jun 4, 2018

vchuravy commented Jun 4, 2018

ceesb commented Jun 4, 2018

ceesb commented Jun 5, 2018

ceesb commented Jun 3, 2018 •

edited

Loading