Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-track @threads when nthreads() == 1 #32181

Closed
wants to merge 1 commit into from

Conversation

staticfloat
Copy link
Sponsor Member

@staticfloat staticfloat commented May 29, 2019

This avoids overhead when threading is disabled, Example benchmark:

using BenchmarkTools, Base.Threads, Test

function func(val, N)
    sums = [0*(1 .^ val) for thread_idx in 1:nthreads()]
    for idx in 1:N
        sums[threadid()] += idx.^val
    end
    return sum(sums)
end

function func_threaded(val, N)
    sums = [0*(1 .^ val) for thread_idx in 1:nthreads()]
    @threads for idx in 1:N
        sums[threadid()] += idx.^val
    end
    return sum(sums)
end

# Ensure they all get the same answer
@test func(2.0, 1<<10) == func_threaded(2.0, 1<<10)

@show @benchmark func(2.0, 1<<10)
@show @benchmark func_threaded(2.0, 1<<10)

I run the benchmarks as:

for JULIA in julia-master ./julia; do
	for T in 1 2; do
		echo "$JULIA with $T threads:"
		JULIA_NUM_THREADS=$T $JULIA speedtest.jl
	done
done

Before this PR:

julia-master with 1 threads:
#= /Users/sabae/src/julia/speedtest.jl:22 =# @benchmark(func(2.0, 1 << 10)) = Trial(24.243 μs)
#= /Users/sabae/src/julia/speedtest.jl:23 =# @benchmark(func_threaded(2.0, 1 << 10)) = Trial(28.331 μs)
julia-master with 2 threads:
#= /Users/sabae/src/julia/speedtest.jl:22 =# @benchmark(func(2.0, 1 << 10)) = Trial(24.239 μs)
#= /Users/sabae/src/julia/speedtest.jl:23 =# @benchmark(func_threaded(2.0, 1 << 10)) = Trial(17.019 μs)

After this PR:

./julia with 1 threads:
#= /Users/sabae/src/julia/speedtest.jl:22 =# @benchmark(func(2.0, 1 << 10)) = Trial(24.254 μs)
#= /Users/sabae/src/julia/speedtest.jl:23 =# @benchmark(func_threaded(2.0, 1 << 10)) = Trial(24.257 μs)
./julia with 2 threads:
#= /Users/sabae/src/julia/speedtest.jl:22 =# @benchmark(func(2.0, 1 << 10)) = Trial(24.263 μs)
#= /Users/sabae/src/julia/speedtest.jl:23 =# @benchmark(func_threaded(2.0, 1 << 10)) = Trial(17.008 μs)

@staticfloat staticfloat added domain:multithreading Base.Threads and related functionality performance Must go faster labels May 29, 2019
@yuyichao
Copy link
Contributor

This check should not happen at macro expansion time.

@raminammour
Copy link
Contributor

Hello,

The only way I have found to work around #15276 is to make sure that the code with @threads, nthreads()=1 runs almost as fast as the non-threaded code. I even suspect that is the reason for the slowdown (@code_warntype shows a Core.Box) in your example.

Would this not make it harder to detect that the closure bug is preventing speedup in multi-threaded code?

Cheers!

@staticfloat
Copy link
Sponsor Member Author

@raminammour while #15276 is something that can get triggered by code like this (and is in this case), this is fixing something independent of that. Yes, the Box slows this down, and when I change this PR according to the feedback above we continue to trigger the Box problems, and thereby don't get the same speedup; but we do get a decent amount of speedup still, because we are eliminating a different source of slowdown.

If you want to test for #15276 style problems, I suggest you do a more direct test than relying on the fact that @threads introduces a closure that induces boxing; that may not be true forever (indeed I hope it's not!).

I've updated the original PR message with new performance metrics and a slightly updated benchmark script.

@raminammour
Copy link
Contributor

Just FYI, the benchmarks on my system are different:


using BenchmarkTools, Base.Threads

function func(val)
    local sum = 0*(1 .^ val)
    for idx in 1:100
        sum += idx.^val
    end
    return sum
end

function func_threaded_let(val)
    local sum = 0*(1 .^ val)
    @threads for idx in 1:100
        let sum=sum
        sum += idx.^val
        end
    end
    return sum
end
function func_threaded(val)
    local sum = 0*(1 .^ val)
    @threads for idx in 1:100
        sum += idx.^val
    end
    return sum
end

@show @benchmark func(2.0)
@show @benchmark func_threaded(2.0)
@show @benchmark func_threaded_let(2.0);
versioninfo()

#= In[134]:30 =# @benchmark(func(2.0)) = Trial(2.333 μs)
#= In[134]:31 =# @benchmark(func_threaded(2.0)) = Trial(4.551 μs)
#= In[134]:32 =# @benchmark(func_threaded_let(2.0)) = Trial(2.568 μs)
Julia Version 1.0.2
Commit d789231e99 (2018-11-08 20:11 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

Hope this helps.

@jebej
Copy link
Contributor

jebej commented May 30, 2019

It seems most of the slowdown is due to the closure bug, I get:

#= REPL[11]:1 =# @benchmark(func(2.0)) = Trial(129.937 ns)

#= REPL[12]:1 =# @benchmark(func_threaded(2.0)) = Trial(3.716 μs)

#= REPL[13]:1 =# @benchmark(func_threaded_let(2.0)) = Trial(442.848 ns)

On v1.1.1.

@staticfloat
Copy link
Sponsor Member Author

The let version that @raminammour posted does not calculate the same thing; it never stores the sum value:

julia> func(2.0)
338350.0

julia> func_threaded(2.0)
338350.0

julia> func_threaded_let(2.0)
0.0

If you look at the @code_native of func_threaded_let() versus func_threaded(), you will see that the _let() variant contains less than half as much code; the optimizer is able to get rid of a lot of work because you're not using the output in any way by creating a new binding within the @threads block.

@raminammour
Copy link
Contributor

Sorry, my bad, here is a Ref version that does the right thing, and is fast as it avoids #15276. (I suspect the time would have been much faster if the whole calculation in the let version was avoided). I may still be confused though :)


function func(val)
     sum = 0*(1 .^ val)
    for idx in 1:100
        sum += idx.^val
    end
    return sum
end

function func_threaded_ref(val)
     sum = Ref(0*(1 .^ val))
    @threads for idx in 1:100
        sum[] += idx.^val
    end
    return sum[]
end
function func_threaded(val)
     sum = 0*(1 .^ val)
    @threads for idx in 1:100
        sum += idx.^val
    end
    return sum
end

@btime func(2.0)
@btime func_threaded(2.0)
@btime func_threaded_ref(2.0);
versioninfo()

  2.520 μs (0 allocations: 0 bytes)
  4.623 μs (203 allocations: 3.20 KiB)
  2.440 μs (2 allocations: 64 bytes)
Julia Version 1.0.2
Commit d789231e99 (2018-11-08 20:11 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

@staticfloat
Copy link
Sponsor Member Author

staticfloat commented May 30, 2019

Yes, that version does indeed remove some of the leftover performance gap; I'll include that in my benchmark above, but notice that you're still getting MUCH slower timings for even the serial case because you're using Julia 1.0; use Julia 1.1 or even better master and you'll see a large difference in performance. Your Ref change doesn't quite eliminate the speed difference, although it does get it closer:

Before this PR:

#= /Users/sabae/src/julia/speedtest.jl:27 =# @benchmark(func(2.0)) = Trial(81.144 ns)
#= /Users/sabae/src/julia/speedtest.jl:28 =# @benchmark(func_threaded(2.0)) = Trial(8.739 μs)
#= /Users/sabae/src/julia/speedtest.jl:29 =# @benchmark(func_threaded_ref(2.0)) = Trial(6.890 μs)

After this PR:

#= /Users/sabae/src/julia/speedtest.jl:27 =# @benchmark(func(2.0)) = Trial(81.176 ns)
#= /Users/sabae/src/julia/speedtest.jl:28 =# @benchmark(func_threaded(2.0)) = Trial(3.775 μs)
#= /Users/sabae/src/julia/speedtest.jl:29 =# @benchmark(func_threaded_ref(2.0)) = Trial(1.674 μs)

@staticfloat
Copy link
Sponsor Member Author

@JeffBezanson, @yuyichao any further comments? I'm unsure why @threads with a single thread retains such a slowdown (I assume because of inference goblins) but the speedup here is not insignificant on its own.

@staticfloat
Copy link
Sponsor Member Author

Pinging @JeffBezanson and @yuyichao again to see if there are any further comments, if not I think we should merge this, as it's a straight performance win when using @threads with only one thread.

@yuyichao
Copy link
Contributor

How much speed up you get if you use https://github.com/JuliaLang/julia/pull/21452/files#diff-7198cded2577e0bdeb563f0f2713347bR69 instead?

Also, that branch is somehow changed to use the latest world in #30838, which is presumably to be consistent with the threading branch. If that's the case and is intended, then this should do that as well which means that it cannot be compiled / inlined ahead of time.

The threading branch is changed to use the latest world in https://github.com/JuliaLang/julia/pull/31398/files#diff-5a6699a5aa7cf07be50461e3c7f68262L693 and in particular the code was added in d9d8d4c. Why is that? That seems to be fairly different from the semantics before and the semantics of a normal for loop. Is that needed for something in particular? Otherwise, I don't really see the point of making this change.

Another reason calling the function is preferred is to make sure the semantics of the loop is actually the same. This means that the single threaded @thread loop should have the same scope rules (in a function) and the same limitations as a normal one. This way you can actually test it with a single thread and be fairly sure that the code could run with multiple threads (short of other true thread related bugs).

Also, is the nested thread loop hack still needed? Isn't the point of partr to get rid of it? (That was what I had in mind when adding that code anyway...)

@yuyichao
Copy link
Contributor

@vtjnash ^^^

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 19, 2019

Yeah, the jl_threading_run function now is just a statically compiled copy of some Julia code. That's pretty slow (and awkward / hard to maintain), so we should move that function into Julia

@vchuravy
Copy link
Sponsor Member

vchuravy commented Jun 19, 2019

I agree that we shouldn't performance hack the nthreads=1 case, I have found it pretty important to debug issues early on by being able to say: Is this threaded loop slow on a single thread? What happens if I remove it.

@staticfloat
Copy link
Sponsor Member Author

How much speed up you get if you use /pull/21452/files#diff-7198cded2577e0bdeb563f0f2713347bR69 instead?

I think you're asking what happens if I remove the invokelatest; the answer is that there is essentially no difference; it's lost in measurement noise. Nothing like the speedup in this PR.

The latest benchmarks:

master:

#= /home/sabae/src/julia/thread_test.jl:28 =# @benchmark(func(2.0)) = Trial(96.508 ns)
#= /home/sabae/src/julia/thread_test.jl:29 =# @benchmark(func_threaded(2.0)) = Trial(5.926 μs)
#= /home/sabae/src/julia/thread_test.jl:30 =# @benchmark(func_threaded_ref(2.0)) = Trial(4.131 μs)

sf/threads_fasttrack:

#= /home/sabae/src/julia/thread_test.jl:28 =# @benchmark(func(2.0)) = Trial(93.319 ns)
#= /home/sabae/src/julia/thread_test.jl:29 =# @benchmark(func_threaded(2.0)) = Trial(2.199 μs)
#= /home/sabae/src/julia/thread_test.jl:30 =# @benchmark(func_threaded_ref(2.0)) = Trial(523.874 ns)

I agree that we shouldn't performance hack the tid=1 case, I have found it pretty important to debug issues early on by being able to say: Is this threaded loop slow on a single thread? What happens if I remove it.

I think this is a very "internals perspective" viewpoint; you are using your deep knowledge of the compiler and its idiosyncrasies to debug code that interacts with these hidden parts of the compiler, but from a user perspective, imposing a 40x slowdown (this is dodging the Ref/capturing issues, without dodging those it's an even worse slowdown) is pretty unacceptable.

This performance slowdown is the single reason why NNlib had all of its @threads removed; because when running single-threaded (which is the default for Julia) our networks run significantly faster than they otherwise would; it's not worth punishing the single-threaded users with a slow NNlib in order for the multi-threaded users to get a multicore speedup. It might be if the overhead were something like 2x or 3x for these small loops, but 40x is just too much of a slowdown. (In NNlib we of course have loops with more work inside them that we are measuring, but the overhead is still far too large)

This means that the single threaded @thread loop should have the same scope rules (in a function) and the same limitations as a normal one. This way you can actually test it with a single thread and be fairly sure that the code could run with multiple threads (short of other true thread related bugs).

I've updated this PR to more closely match the scope semantics of the other logic branches.

@yuyichao
Copy link
Contributor

I think you're asking what happens if I remove the invokelatest; the answer is that there is essentially no difference; it's lost in measurement noise. Nothing like the speedup in this PR.

No, I mean if you remove the invokelatest and use that branch.

@yuyichao
Copy link
Contributor

I've updated this PR to more closely match the scope semantics of the other logic branches.

It seems that it still doesn't have the same limitation as the other branches.

All what I'm saying is that you should just use the invokelatest branch. I doubt the performance of that will be good enough as is but it's the only way you get identical semantics.
Then there's the question of whether the invokkelatest is needed, which is what I'm asking @vtjnash about. If it is, then there's no way you can get better performance. If it isn't, then both the julia code and the C code should switch away from it.

@staticfloat
Copy link
Sponsor Member Author

staticfloat commented Jun 19, 2019

That's a good idea @yuyichao; I didn't quite realize that the nested-threading case was essentially the same as my "single-thread" case. Changing this to just use that branch when nthreads() == 1 gets the same level of performance. Eliminating invokelatest() does have an impact, but not much: It lowers the func_threaded_ref() time from 550 ns to 510ns, so a time difference of ~8%.

@yuyichao
Copy link
Contributor

In that case this looks good enough as is and the necessity of the invokelatest is just a separate issue.

@vchuravy
Copy link
Sponsor Member

I think this is a very "internals perspective" viewpoint; you are using your deep knowledge of the compiler and its idiosyncrasies to debug code that interacts with these hidden parts of the compiler, but from a user perspective, imposing a 40x slowdown (this is dodging the Ref/capturing issues, without dodging those it's an even worse slowdown) is pretty unacceptable.

I think it is exactly the other way around. I as someone with an internals background and sufficient experience can guess at why a @threads loop is slow when nthreads > 1.
How is someone without the knowledge supposed to figure out that they have a performance bottleneck? Also it is not a 40x slowdown since this is a constant start-up cost.

function func_threaded_ref(val, N)
           sum = Ref(0*(1 .^ val))
           @threads for idx in 1:N
               sum[] += idx.^val
           end
           return sum[]
       end

function func_ref(val, N)
           sum = Ref(0*(1 .^ val))
           for idx in 1:N
               sum[] += idx.^val
           end
           return sum[]
       end
julia> @btime func_ref(1, 1)
  1.268 ns (0 allocations: 0 bytes)
1

julia> @btime func_threaded_ref(1, 1)
  4.096 μs (10 allocations: 848 bytes)
1

julia> @btime func_threaded_ref(1, 10000)
  20.275 μs (10 allocations: 848 bytes)
50005000

julia> @btime func_ref(1, 10000)
  21.107 μs (0 allocations: 0 bytes)
50005000

Oh no! 4000x slow-down. For me one of the strong suites of Julia is performance predictability, and I feel that my micro-optimizing this case we make the performance model of a threaded loop (that there is a startup-cost to pay) more opaque and less user-friendly.

@chethega
Copy link
Contributor

chethega commented Jul 5, 2019

This avoids overhead when threading is disabled, Example benchmark:

FWIW, this is not a good example, since for nthreads>1 it (a) produces wrong results and (b) is slow. Both are for the same reason: All threads try to read and write to the same memory location concurrently. This gives unpredictable (i.e. wrong) results and also makes your poor CPU weep when trying to synchronize caches between cores.

Single thread:

julia> @btime func_ref(1, 1<<20)
  3.876 ms (0 allocations: 0 bytes)
549756338176

julia> @btime func_threaded_ref(1, 1<<20)
  3.887 ms (10 allocations: 848 bytes)
549756338176

Two threads:

julia> @btime func_ref(1, 1<<20)
  3.876 ms (0 allocations: 0 bytes)
549756338176

julia> @btime func_threaded_ref(1, 1<<20)
  7.546 ms (17 allocations: 1.56 KiB)
239044531651

julia> @btime func_threaded_ref(1, 1<<20)
  7.531 ms (17 allocations: 1.56 KiB)
137439215616

Do we have an actually correct (deterministic) example with significant (not O(1)) slowdown due to @threads?

@staticfloat
Copy link
Sponsor Member Author

@chethega Sure, let's push the example farther toward reality. I've updated the benchmarks at the top to (a) have a slightly larger workload (1024 items) (b) actually compute something correctly no matter how many threads are assigned to it, and (c) remove the Ref workaround for #15276 since it's not needed anymore, as I'm storing things within the sums array.

How is someone without the knowledge supposed to figure out that they have a performance bottleneck?

I think when someone is chasing performance, it's okay to expect them to do a little reading. We should not avoid fast paths just because a slow path exists; ideally we would simply have extremely fast work division code and the 4us of constant overhead would not exist, unfortunately, it does.

Also it is not a 40x slowdown since this is a constant start-up cost. ..... Oh no! 4000x slow-down.

I take your point ;) and I should not have used that kind of comparison when I'm explicitly talking about very small problem sizes. These arise in things like NNlib where it's equally likely that I'm running a loop over 10M elements as it is I'm running a loop over 10. In general, I agree that a better approach would be to have a way to have @threads conditionally execute based on problem size, but since this is such an obvious quick performance win, (with a 1-line diff that changes no semantics) I don't see why it's controversial.

This avoids overhead when threading is disabled, Example benchmark:

```
using BenchmarkTools, Base.Threads, Test

function func(val, N)
    sums = [0*(1 .^ val) for thread_idx in 1:nthreads()]
    for idx in 1:N
        sums[threadid()] += idx.^val
    end
    return sum(sums)
end

function func_threaded(val, N)
    sums = [0*(1 .^ val) for thread_idx in 1:nthreads()]
    @threads for idx in 1:N
        sums[threadid()] += idx.^val
    end
    return sum(sums)
end

@test func(2.0, 1<<10) == func_threaded(2.0, 1<<10)

@show @benchmark func(2.0, 1<<10)
@show @benchmark func_threaded(2.0, 1<<10)
```

Running the benchmarks as:
```
for JULIA in julia-master ./julia; do
    for T in 1 2; do
        echo "$JULIA with $T threads:"
        JULIA_NUM_THREADS=$T $JULIA speedtest.jl
    done
done
```

Before this commit:
```
julia-master with 1 threads:
julia-master with 2 threads:
```

After this commit:
```
./julia with 1 threads:
./julia with 2 threads:
```
@JeffBezanson
Copy link
Sponsor Member

Try #32477?

@chethega
Copy link
Contributor

chethega commented Jul 6, 2019

but since this is such an obvious quick performance win, (with a 1-line diff that changes no semantics) I don't see why it's controversial.

Fair enough. Even with #32477, I see no objection with the current variant. I think @yuyichao's comment referred to a previous version that you forced-pushed away.

@vchuravy
Copy link
Sponsor Member

vchuravy commented Jul 6, 2019

In general, I agree that a better approach would be to have a way to have @threads conditionally execute based on problem size, but since this is such an obvious quick performance win, (with a 1-line diff that changes no semantics) I don't see why it's controversial.

I briefly thought that would be a good idea, but you can't make that judgement as part of the macro since my problem size of 4 might be as work intensive as your problem of size 10k.

I think when someone is chasing performance, it's okay to expect them to do a little reading. We should not avoid fast paths just because a slow path exists;

When I originally fixed #24688 it had taken a year and a half since we originally noticed some weirdness going on to use to nail down the issue. If we had simply shortcutted the semantic behaviour of @threads, e.g. outline this thunk, we would have simply shurreg an been like: "threading performance is bad, we just need more cores". Nowadays the awareness for that issue is much higher, but that is kinda besides the point.

It looks like #32477 brings down the overhead to 1us instead of 4us? From my perspective adding @threads should not be a free action since it drastically changes the semantics of the code and you will have to rewrite the surrounding code to have the right semantics.

I can probably live with having a trigger/switch in the macro that enables the old behaviour. Then at least I can tell people, is your code with @threads force=true and nthreads()==1 still slow?

@staticfloat
Copy link
Sponsor Member Author

Even with #32477, I see no objection with the current variant.

With #32477 this optimization doesn't apply anymore; Jeff has removed the branch. I am content with only 1us of overhead; that is below my arbitrary threshold of performance anxiety. :)

@staticfloat staticfloat closed this Jul 6, 2019
@jebej
Copy link
Contributor

jebej commented Jul 23, 2019

Not sure where to put this, but I wanted to try the last bench functions, and surprisingly do not get a speedup in the multi-threaded case (with 4 threads, on a i7-3770K, on Windows 7).

Julia 1.1.1

julia> @benchmark func(2.0, 1<<10)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     7.161 μs (0.00% GC)
  median time:      7.234 μs (0.00% GC)
  mean time:        7.448 μs (0.00% GC)
  maximum time:     31.932 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark func_threaded(2.0, 1<<10)
BenchmarkTools.Trial:
  memory estimate:  160 bytes
  allocs estimate:  2
  --------------
  minimum time:     7.307 μs (0.00% GC)
  median time:      14.321 μs (0.00% GC)
  mean time:        15.243 μs (0.00% GC)
  maximum time:     7.044 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I also tried this on the new alpha; allocations are much more important there:

Julia 1.3.0 alpha

julia> @benchmark func(2.0, 1<<10)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     7.161 μs (0.00% GC)
  median time:      7.234 μs (0.00% GC)
  mean time:        7.470 μs (0.00% GC)
  maximum time:     19.656 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark func_threaded(2.0, 1<<10)
BenchmarkTools.Trial:
  memory estimate:  3.56 KiB
  allocs estimate:  30
  --------------
  minimum time:     11.399 μs (0.00% GC)
  median time:      18.413 μs (0.00% GC)
  mean time:        18.701 μs (0.00% GC)
  maximum time:     838.554 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants