Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrap range variable in a let block for at_threads #24688

Merged
merged 1 commit into from
Nov 21, 2017
Merged

Conversation

vchuravy
Copy link
Sponsor Member

This works around performance issues I have been seeing due to #15276.

In particular this function showed no performance scaling when i recently introduced threading to students:

function threaded_sum(arr)
    @assert length(arr) % nthreads() == 0
    results::Vector{Float64} = zeros(eltype(arr), nthreads())
    @threads for tid in 1:nthreads()
        # split work
        len = div(length(arr), nthreads())
        domain = ((tid-1)*len +1):tid*len
        acc = zero(eltype(arr))
        @simd for i in domain
            @inbounds acc += arr[i]
        end
        results[tid] = acc
    end
    sum(results)
end
julia> data = rand(100_000);
julia> @benchmark threaded_sum($data)
BenchmarkTools.Trial: 
  memory estimate:  6.09 MiB
  allocs estimate:  398994
  --------------
  minimum time:     7.674 ms (0.00% GC)
  median time:      8.117 ms (0.00% GC)
  mean time:        8.681 ms (1.45% GC)
  maximum time:     31.901 ms (75.30% GC)
  --------------
  samples:          576
  evals/sample:     1
@benchmark my_sum($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.491 μs (0.00% GC)
  median time:      14.117 μs (0.00% GC)
  mean time:        23.058 μs (0.00% GC)
  maximum time:     417.783 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

These numbers are for single threaded performance and where my_sum is simple threaded_sum without @threads.

Adding code_warntype(STDOUT, threadsfor_fun, (Bool,)) in line


Shows that the example function which is type-stable without @threads becomes https://gist.github.com/vchuravy/e60e0e63b79de59f92c95a5864c66f37.

The change in this PR introduces a let block for the range variable eliminating almost all type-instability https://gist.github.com/vchuravy/0a4203674d02c92afee9b10b58756bbe and drastically improves performance:

julia> @benchmark threaded_sum($data)
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  3
  --------------
  minimum time:     11.806 μs (0.00% GC)
  median time:      11.917 μs (0.00% GC)
  mean time:        12.687 μs (0.00% GC)
  maximum time:     260.019 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

@ararslan ararslan added the domain:multithreading Base.Threads and related functionality label Nov 21, 2017
@JeffBezanson JeffBezanson merged commit 7704b09 into master Nov 21, 2017
@JeffBezanson JeffBezanson deleted the vc/let_threaded branch November 21, 2017 22:49
@JeffBezanson JeffBezanson added the performance Must go faster label Nov 21, 2017
@tkoolen tkoolen mentioned this pull request May 16, 2018
7 tasks
ararslan pushed a commit that referenced this pull request May 16, 2018
ararslan pushed a commit that referenced this pull request May 27, 2018
@ceesb
Copy link

ceesb commented Jun 3, 2018

The newly introduced "let" makes my threaded code 2 times slower between 0.6.2 and 0.6.3, even when running with JULIA_NUM_THREADS=1 .. I have trouble debugging what happens here: code_warntype shows me nothing weird, and profiling my code in 0.6.2 versus 0.6.3 doesn't show a change in hotspots. How can I find out what's happening?

Just for my sanity I reverted this change in 0.6.3 and then the performance is as it was.

Maybe also useful to note is that if I remove at_threads from my loop altogether, I get the performance in 0.6.3 I would expect from a single-threaded loop (which is the same as in 0.6.2 with at_threads and JULIA_NUM_THREADS=1).

@vchuravy
Copy link
Sponsor Member Author

vchuravy commented Jun 3, 2018 via email

@ceesb
Copy link

ceesb commented Jun 4, 2018

I'm struggling to minimize it, but I attached something that runs by itself: standalone.zip.

Timings on my laptop:

For 0.6.3:
$ JULIA_NUM_THREADS=1 julia standalone.jl
12.961506 seconds (151.02 k allocations: 527.064 MiB, 2.17% gc time)

For 0.6.2:
$ JULIA_NUM_THREADS=1 julia standalone.jl
6.124962 seconds (158.29 k allocations: 527.286 MiB, 4.68% gc time)

I used at_code_warntype on threadsfor_fun and indeed it looks much cleaner with the let-block in 0.6.3, I just don't see what ruins the performance (in my experience this usually means something is ruining the cache). Any pointers are appreciated.

@vchuravy
Copy link
Sponsor Member Author

vchuravy commented Jun 4, 2018

Do you have the same performance drop if you just remove @threads? You mentioned that your performance with JULIA_NUM_THREADS=1 also dropped.
How is the performance of JULIA_NUM_THREADS=1 with @threads compared to without @threads.

@ceesb
Copy link

ceesb commented Jun 4, 2018

Without @threads on 0.6.2:
$ julia standalone.jl
6.187303 seconds (144.19 k allocations: 526.704 MiB, 4.49% gc time)

Without @threads on 0.6.3:
$ julia standalone.jl
6.147939 seconds (144.41 k allocations: 526.705 MiB, 4.41% gc time)

@ceesb
Copy link

ceesb commented Jun 5, 2018

OK .. I think I found it .. It looks like that without the let in Julia 0.6.2, because of all the Any types in threadsfor_fun, my state variable in the threaded loop was passed as a pointer (which is good):

function flushcache!(state::IncrementalCovarianceTiled)
  if state.cacheCount == 0
    return
  end

  Threads.@threads for y in 1:state.nrTilesY
    dothreadwork(state,y)
  end

  state.cacheCount = 0
  return
end

But due to the let and the type propagation, it looks like somewhere a copy is made of state, screwing up the cache friendliness of my code. I fixed it by wrapping state into a Ref:

function flushcache!(state::IncrementalCovarianceTiled)
  if state.cacheCount == 0
    return
  end

  stateref = Ref{IncrementalCovarianceTiled}(state)

  Threads.@threads for y in 1:state.nrTilesY
    dothreadwork(stateref,y)
  end

  state.cacheCount = 0
  return
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants