Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No performance scaling in threading #17395

Closed
ranjanan opened this issue Jul 13, 2016 · 47 comments
Closed

No performance scaling in threading #17395

ranjanan opened this issue Jul 13, 2016 · 47 comments
Labels
multithreading Base.Threads and related functionality performance Must go faster

Comments

@ranjanan
Copy link
Contributor

Consider the following code:

using Base.Threads
println("Number of threads = $(nthreads())")
x = rand(10^6)
y = zeros(10^6)
println("Warmup!")
for i = 1:10^6
    y[i] = sin(x[i])^2 + cos(x[i])^2
end
t1 = @elapsed for i = 1:10^6
    y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2 = @elapsed @threads for i = 1:10^6
    y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
println("Serial time = $t1")
println("Parallel time = $t2")

The output recorded on OSX is:

Number of threads = 4
Warmup!
Serial time = 0.239838418
Parallel time = 0.282324893

The output recorded on Linux is:

Number of threads = 4
Warmup!
Serial time = 0.227327406
Parallel time = 0.542067206
@tkelman tkelman added performance Must go faster multithreading Base.Threads and related functionality labels Jul 13, 2016
@ranjanan
Copy link
Contributor Author

@JeffBezanson Julia is aware of the types of x and y before the loop executes. So do you have any idea where this regression is coming from?

@timholy
Copy link
Member

timholy commented Jul 13, 2016

Are you deliberately benchmarking what happens when you don't put it in a function? Note that the times inside a function are two orders of magnitude faster:

using Base.Threads, BenchmarkTools
function test1!(y, x)
    @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    y
end
function testn!(y, x)
    @assert length(y) == length(x)
    @threads for i = 1:length(x)
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    y
end

I have little experience playing with the threads feature, but one thing I've noticed is that my performance fluctuates a lot from session to session: on a laptop with 2 "real" cores and 2 threads, in some sessions I can get a nearly 2x improvement, while if I quit and restart julia I might get a 4x worsening. Quit & restart again, and perhaps I'm back to the 2x improvement. Weird.

@pkofod
Copy link
Contributor

pkofod commented Jul 13, 2016

Is OSX and linux (ubuntu?) installed on the same machine?

@ranjanan
Copy link
Contributor Author

@timholy Of course, it's best not to use globals when benchmarking. However, this regression is not supposed to happen irrespective of whether you put everything in functions or not.

using Base.Threads
function driver()
    println("Number of threads = $(nthreads())")
    x = rand(10^6)
    y = zeros(10^6)
    println("Warmup!")
    warmup(x, y)
    t1 = test1(x, y)
    t2 = test2(x, y)
    println("Serial time = $t1")
    println("Parallel time = $t2")
end
function warmup(x::Vector{Float64}, y::Vector{Float64})
    for i = 1:10^6
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
end
function test1(x::Vector{Float64}, y::Vector{Float64})
    t1 = @elapsed for i = 1:10^6
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    @assert sum(y) == 10^6
    t1
end
function test2(x::Vector{Float64}, y::Vector{Float64})
    t2 = @elapsed @threads for i = 1:10^6
        y[i] = sin(x[i])^2 + cos(x[i])^2
    end
    @assert sum(y) == 10^6
    t2
end
driver()

gives

Number of threads = 4
Warmup!
Serial time = 0.039676539
Parallel time = 0.048867776

And yes, sometimes threading performance is flaky at times. AFAIK, this is because of unresolved gc issues.

@ranjanan
Copy link
Contributor Author

@pkofod I first saw this on OSX and then verified it on a different Linux machine. The point I was trying to make was not the numbers themselves, but the fact that there seems to be no scaling.

@yuyichao
Copy link
Contributor

#15874

@yuyichao
Copy link
Contributor

This seems to be a openlibm issue. Using the system libm here have good performance scaline.

@yuyichao
Copy link
Contributor

Also note that the type of x and y in the loop still has issue, but the dispatch cost seems to be much cheaper than the calculation.

@yuyichao
Copy link
Contributor

Script I use for benchmarking

using Base.Threads

println("Number of threads = $(nthreads())")

# sin1(x::Float64) = ccall((:sin, Base.Math.libm), Float64, (Float64,), x)
# cos1(x::Float64) = ccall((:cos, Base.Math.libm), Float64, (Float64,), x)
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
cos1(x::Float64) = ccall(:cos, Float64, (Float64,), x)

function test1!(y, x)
    # @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end

function testn!(y::Vector{Float64}, x::Vector{Float64})
    # @assert length(y) == length(x)
    Threads.@threads for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end
n = 10^7
x = rand(n)
y = zeros(n)
@time test1!(y, x)
@time testn!(y, x)
@time test1!(y, x)
@time testn!(y, x)

With openlibm, i.e. Base.Math.libm

Number of threads = 8
  1.521594 seconds (4.77 k allocations: 512.504 KB)
  0.338695 seconds (11.96 k allocations: 515.312 KB)
  0.377355 seconds (4 allocations: 160 bytes)
  0.392234 seconds (6 allocations: 224 bytes)

With glibc libm

Number of threads = 8
  0.421319 seconds (4.66 k allocations: 508.051 KB)
  0.246319 seconds (11.90 k allocations: 512.966 KB)
  0.562959 seconds (4 allocations: 160 bytes)
  0.098284 seconds (6 allocations: 224 bytes)

@yuyichao
Copy link
Contributor

More typical timing (glibc libm and openlibm)

yyc2:~/projects/julia/master
yuyichao% JULIA_NUM_THREADS=4 ./julia --color=yes a.jl
Number of threads = 4
  0.515686 seconds (4.66 k allocations: 508.051 KB)
  0.159722 seconds (11.90 k allocations: 512.966 KB)
  0.429442 seconds (4 allocations: 160 bytes)
  0.145766 seconds (6 allocations: 224 bytes)
yyc2:~/projects/julia/master
yuyichao% JULIA_NUM_THREADS=4 ./julia --color=yes a.jl
Number of threads = 4
  1.530746 seconds (4.77 k allocations: 512.504 KB)
  0.445153 seconds (11.96 k allocations: 515.312 KB)
  0.282715 seconds (4 allocations: 160 bytes)
  0.418572 seconds (6 allocations: 224 bytes)

@yuyichao
Copy link
Contributor

Interestingly, I can't reproduce this in C with either openmp or cilk......

@ranjanan
Copy link
Contributor Author

glibc seems slower than openlibm serially but scales with threading...

@yuyichao
Copy link
Contributor

Hmm, interesting, on another machine I got.

yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libopenlibm"
  3.561010 seconds
  1.213387 seconds (20 allocations: 640 bytes)
yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libm"
  2.449299 seconds
  0.853167 seconds (20 allocations: 640 bytes)

using

using Base.Threads

println("Number of threads = $(nthreads())")

# const libm_name = "libopenlibm"
const libm_name = "libm"

@show libm_name

sin1(x::Float64) = ccall((:sin, libm_name), Float64, (Float64,), x)
cos1(x::Float64) = ccall((:cos, libm_name), Float64, (Float64,), x)

@noinline function test1!(y, x)
    # @assert length(y) == length(x)
    for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end

@noinline function testn!(y::Vector{Float64}, x::Vector{Float64})
    # @assert length(y) == length(x)
    Threads.@threads for i = 1:length(x)
        y[i] = sin1(x[i])^2 + cos1(x[i])^2
    end
    y
end

function run_tests()
    n = 10^7
    x = rand(n)
    y = zeros(n)
    test1!(y, x)
    testn!(y, x)
    @time for i in 1:10
        test1!(y, x)
    end
    @time for i in 1:10
        testn!(y, x)
    end
end

run_tests()

So both of the scales and glibc is always faster.

@yuyichao
Copy link
Contributor

For the record, I'm using LLVM 3.8.0 on both machine. And the C code I use is

/* #include <cilk/cilk.h> */
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>

double (*sin1)(double) = NULL;
double (*cos1)(double) = NULL;

uint64_t get_time(void)
{
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
}

static inline double kernel(double x)
{
    double s = sin1(x);
    double c = cos1(x);
    return s * s + c * c;
}

__attribute__((noinline)) void test1(double *y, double *x, size_t n)
{
    asm volatile("":::"memory");
    for (size_t i = 0;i < n;i++) {
        y[i] = kernel(x[i]);
    }
    asm volatile("":::"memory");
}

/* __attribute__((noinline)) void testn(double *y, double *x, size_t n) */
/* { */
/*     asm volatile("":::"memory"); */
/*     cilk_for (size_t i = 0;i < n;i++) { */
/*         y[i] = kernel(x[i]); */
/*     } */
/*     asm volatile("":::"memory"); */
/* } */

__attribute__((noinline)) void testn2(double *y, double *x, size_t n)
{
    asm volatile("":::"memory");
#pragma omp parallel for
    for (size_t i = 0;i < n;i++) {
        y[i] = kernel(x[i]);
    }
    asm volatile("":::"memory");
}

__attribute__((noinline)) void run_tests(double *y, double *x, size_t n)
{
    uint64_t t_start = get_time();
    test1(y, x, n);
    uint64_t t_end = get_time();
    printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
    /* t_start = get_time(); */
    /* testn(y, x, n); */
    /* t_end = get_time(); */
    /* printf("time: %.3f ms\n", (t_end - t_start) / 1e6); */
    t_start = get_time();
    testn2(y, x, n);
    t_end = get_time();
    printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
}

int main()
{
    void *hdl = dlopen("libm.so.6", RTLD_NOW);
    /* void *hdl = dlopen("libopenlibm.so", RTLD_NOW); */
    sin1 = (double(*)(double))dlsym(hdl, "sin");
    cos1 = (double(*)(double))dlsym(hdl, "cos");
    size_t n = 10000000;
    double *x = malloc(sizeof(double) * n);
    /* double *x = calloc(n, sizeof(double)); */
    double *y = calloc(n, sizeof(double));
    for (size_t i = 0;i < n;i++) {
        x[i] = 1;
    }
    run_tests(y, x, n);
    run_tests(y, x, n);
    return 0;
}

(the cilkplus version is commented out for clang....)

@vtjnash vtjnash changed the title No performance scaling in threading high overhead on @threads Jul 19, 2016
@vtjnash
Copy link
Member

vtjnash commented Jul 19, 2016

I see good scaling once there is enough data to process to overcome the setup costs.

@yuyichao
Copy link
Contributor

I don't think the setup is the issue? It doesn't explain the difference between openlibm and system libm (system libm being faster with multiple thread) and there are already 10^6 elements in the test.

@ranjanan
Copy link
Contributor Author

ranjanan commented Jul 19, 2016

@vtjnash Can you post the problem size and perf vs thread size? Because I do not see that, with openlibm atleast. With glibc, I see some nominal scaling.

@vtjnash
Copy link
Member

vtjnash commented Jul 19, 2016

Ah, OK. Profiling is showing wide variation in the number of hits in the sin and cos functions. Do you see scaling with hyperthreading disabled? From the assembly, it looks like this problem is pretty solidly avx-bound, so I wonder if this is some sort of resource-contention in the processor. But the question is: which resource?

@vtjnash vtjnash changed the title high overhead on @threads No performance scaling in threading Jul 19, 2016
@yuyichao
Copy link
Contributor

So this issue seems to be really system dependent. Here's my observation so far but I'm not sure if others have similar observations. The scripts I used can be found here if others want to reproduce.

TL;DR, the slowness I've seen on the two systems I've tested can be fully explained by a single thread slowdown that can mysteriously disappear due to certain operations on the thread.

Here are some of the main observations that leads to the conclusion above:

  1. Openlibm is slower (8x on haswell, 2.3x on skylake) before the first yield() after running openlibm functions.

  2. The multithread performance has good scaling (~2.5-3.3x on haswell, ~3.8-3.9x on skylake) for openlibm if the single thread performance uses the slow version (the timing before yield is called). Since it is not really possible to call yield on the thread, it's likely that the poor threading performance I see is just because the slowness on the threads are not "fixed" on the workers and only on the master threads.

  3. I reduced the yield() call to

    @noinline function yield2()
        current_task().state == :runnable || error()
    end

    which is not as reliable but can "fix" the performance on master thread most of the time. This version is also thread safe (doesn't use tasks or libuv) so it can be run on worker threads too. After running the function on each threads. The performance of openlibm threading version improved by the expected factor.

The effect also goes away when the loop is moved to C or when the ccall is replaced by llvm intrinsics so it might have something related to certain detail of the execution history. However perf stat suggests that instruction count, branch misprediction rate and cache miss rate are basically the same for the fast and slow version so it's unclear what CPU behavior is changed. The code does not move so it's hard to believe if this is related to code alignment either.

@ChrisRackauckas
Copy link
Member

I'm actually getting that the performance is scaling worse than serial on a problem. Is this the same issue?

https://gist.github.com/ChrisRackauckas/6970aa6c3fa42c987b63dc9fe21c48fd

@yuyichao
Copy link
Contributor

yuyichao commented Aug 8, 2016

It's unclear what this problem is right now.

@ViralBShah
Copy link
Member

Perhaps this is the same issue showing up in #17751

@ViralBShah ViralBShah added this to the 0.5.x milestone Aug 8, 2016
@oxinabox
Copy link
Contributor

oxinabox commented Aug 9, 2016

Running the code from #17395 (comment)

I get good scaling

ubuntu@tlaloc:~$ julia compare.jl
Number of threads = 8
libopenlibm
  3.518335 seconds
  0.617270 seconds (20 allocations: 800 bytes)
libm
  6.260513 seconds
  1.036404 seconds (20 allocations: 800 bytes)
libopenlibm
  3.487945 seconds
  0.569065 seconds (20 allocations: 800 bytes)

ubuntu@tlaloc:~$ julia compare2.jl
Number of threads = 8
libopenlibm
t1 = 0.515692918
t2 = 0.515126275

Running: #17395 (comment)

ubuntu@tlaloc:~$ julia op.jl
Number of threads = 8
Warmup!
Serial time = 0.413255659
Parallel time = 1.558858472

I get worse performance with parallel than not. (as did OP)


and running #17395 (comment)

ubuntu@tlaloc:~$ julia op_noglobal.jl
Number of threads = 8
Warmup!
Serial time = 0.042985187
Parallel time = 0.033117154

So same performance on with both (as did the OP)

So all the results seem more or less the same on my system, as the posted results.
So it is not a AMD vs Intel difference.

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Aug 23, 2016

I tracked down the issue in my code to be due to eachindex(u) not strictly typing when inside Base.@threads. You can fix it by defining local uidx::Base.OneTo{Int64} = eachindex(u) and using that (you have to put the type specification too). So yes, it is a different issue.

@yuyichao
Copy link
Contributor

Note that one of the trick I've been using to workaround #15276 that works pretty well for threading loops is to put just the loop itself in a separate function, use a ref if you need to update local variable.

@elanmart
Copy link

elanmart commented Sep 16, 2016

Hi, is there any progress on this issue or any new workarounds that are not mentioned in this thread? I was porting some C++ code to julia, but had to abandon the project since I got 1.5 scaling of 8 threads. I also can still reproduce the results mentioned before in this thread (where scaling close to 1.0 happens with 8 threads).

Edit: also, maybe it's just my ignorance, but @threads seems to slow down the code by a factor of 2 when calling rand()

julia> versioninfo()
Julia Version 0.6.0-dev.674
Commit 8cb72ee (2016-09-16 12:29 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

julia> using Base.Threads

julia> function rfill!(A)
           for i in 1:length(A)
              A[i] = rand()
           end
       end

julia> function rfill_mt!(A)
           @threads for i in 1:length(A)
              A[i] = rand()
           end
       end

julia> function driver()
           A = zeros(64000000);
           println("num threads: $(nthreads())")
           rfill!(A);
           rfill_mt!(A);

           @time rfill!(A)
           @time rfill_mt!(A)
       end

julia> driver()
num threads: 8
  0.222452 seconds
  0.554084 seconds (2 allocations: 48 bytes)

@vtjnash
Copy link
Member

vtjnash commented Sep 17, 2016

I think rand isn't threadsafe, so that's not meaningful. The original bug here appears to be a hardware issue that could hit you at any time (even single threaded) and on any code (even if you used asm instead of C++)

@elanmart
Copy link

For me at least it seems that the scaling is quite poor across few benchmarks, not only the ones mentioned above. Also, it seems quite unfortunate that code using random numbers will run slower when used with mulithreading.

@ChrisRackauckas
Copy link
Member

That's just because what's in Base is not thread safe. But RNG.jl is trying to make some thread-safe ones I believe.

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Mar 1, 2017

So there appears to be multiple issues here but at least we've made some progress on the issue I observed. To summarize, the issue I've seen is that there are significant slow down when calling libopenlibm functions on each thread unless some magic code is run on the thread after calling some openlibm functions. The issue does not appear with the system libm.

This issue is actually caused by the slow down when mixing avx and sse instructions and is made more confusing by a glibc bug that'll be fixed in the next release. The issue was mainly solved by going through the performance counter and see which one has significant difference between the good and the bad version, many thanks to @Keno and @kpamnany for advices on the right counter/processor manual to look at....

Explaination for each features:

  1. At least on Skylake+, there's a significant and persistent cost when executing sse instructions when the upper bits of the ymm registers are dirty. Here the sse code is libopenlibm compiled for a generic x64 arch and the cause of the dirty state is the confusion part.
  2. The "magic" code is actually vzeroupper or sth else with the same effect. It puts the upper bits in a clean zero state and let sse instructions run fast.
  3. The per-thread state is the cleaness of the upper bits. It's not actually necessary to run libopenlibm functions on each thread before doing the magic but the threads are likely started in a dirty state.
  4. The most confusion source of the dirty state is the glibc bug. The libopenlibm uses PLT entries to call internal functions which calls the libc callback at the first time to resolve symbol address. This callback function needs to save and restore all registers that can be used to pass arguments so it uses avx instructions to save and restore ymm. This means that the register will be dirty uppon returning from any plt entry the first time it is used and when this happens in a loop with sse instructions, the whole loop will be executed in the dirty state in the absence of a vzeroupper in the loop.

Fixing this bug will probably involve fixing openlibm to use VEX encoding/AVX instructions (dispatch at load/runtime). We might also want to skip PLT for openlibm internal references.

@yuyichao yuyichao removed the help wanted Indicates that a maintainer wants help on an issue or pull request label Mar 1, 2017
@yuyichao
Copy link
Contributor

yuyichao commented Mar 1, 2017

@ranjanan Just trying to see if there's anything left here that's not related to mixing avx and sse. I also can't figure out (or find) if you have good scaling using the system libm instead.

  1. Which linux/glibc versison are you using. Does it (or the hardware) support AVX?
  2. Does the performance issue go away if you put a Base.llvmcall("call void asm \"vzeroupper\", \"\"()\nret void", Void, Tuple{}) before each loop iteration?

If you see good scaling with vzeroupper and a new enough system libm, then we should rename this issue to solving sse/avx mixing issue. OpenLibm to start though it might apply to other libs too.

There are two other issues in this thread #17395 (comment) and #17395 (comment). Those should probably be separate issues if they are still not solved.

@ranjanan
Copy link
Contributor Author

ranjanan commented Mar 1, 2017

@yuyichao:

Just to clarify things, I'm running the benchmark codes again right now. On running code from #17395 (comment), which is the original benchmark code, I get:

Number of threads = 4
Warmup!
Serial time = 0.020891164
Parallel time = 0.034003818

which shows no scaling.

On running code from #17395 (comment),
I get:

Number of threads = 4
libm_name = "libm"
  2.056310 seconds
  0.741611 seconds (20 allocations: 640 bytes)


Number of threads = 4
libm_name = "libopenlibm"
  2.193386 seconds
  0.930211 seconds (20 allocations: 640 bytes)

Interestingly, both of these show (~ 2.4x) scaling. But I suppose the difference between both the above benchmark codes is that there is 100x more data parallelism in the second benchmark code (size 10^7, and running it 10 times).

Now addressing both your points:

  1. I wrote these tests on my mac primarily, where I run MacOS Sierra. I am on a Intel(R) Core(TM) i5-5257U CPU which supports SSE4.1/4.2, AVX 2.0 according to this. Also, by default, I'm using the glibc version that is shipped with Julia v0.5.0 on Mac.

  2. Now, I added that Base.llvmcall to every loop iteration, so the threads code becomes this:

    function test2(x::Vector{Float64}, y::Vector{Float64})                          
        t2 = @elapsed @threads for i = 1:10^6                                       
            Base.llvmcall("call void asm \"vzeroupper\", \"\"()\nret void", Void, Tuple{})
            y[i] = sin(x[i])^2 + cos(x[i])^2                                        
        end                                                                         
        @assert sum(y) == 10^6                                                      
        t2                                                                          
    end   

    Unfortunately, this does not make any difference to benchmark results of No performance scaling in threading #17395 (comment).

I hope this summary helps.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 1, 2017

On running code from #17395 (comment), which is the original benchmark code, I get:

Can you run it twice? I get:

julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.012560089
Parallel time = 0.019910309

julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.013349351
Parallel time = 0.004878786

There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).

I wrote these tests on my mac primarily, where I run MacOS Sierra. I am on a Intel(R) Core(TM) i5-5257U CPU which supports SSE4.1/4.2, AVX 2.0 according to this. Also, by default, I'm using the glibc version that is shipped with Julia v0.5.0 on Mac.

We don't ship libc (and not glibc), that glibc question is for the linux one. According to Jameson, the macOS libc plt callback uses XSAVE and shouldn't have this problem. (though other things can still put it in a dirty state.) Given that you don't see a difference adding vzeroupper you are probably not hitting the same issue as me.

@ranjanan
Copy link
Contributor Author

ranjanan commented Mar 2, 2017

Ah, you are right:


julia> include("thread2.jl")
Number of threads = 4
Warmup!
Serial time = 0.024926309
Parallel time = 0.038902679

julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.023512779
Parallel time = 0.010175957

There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).

I modified my warmup code to include a threaded loop now.

function warmup(x::Vector{Float64}, y::Vector{Float64})                         
    for i = 1:10^6                                                              
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
    @threads for i = 1:10^6                                                     
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
end  

But this doesn't seem to help:

julia> include("thread2.jl") # `driver()` called inside
Number of threads = 4
Warmup!
Serial time = 0.034189366
Parallel time = 0.029591109

Of course, if I call driver() a second time, it shows the expected speedup. Do you know why this is happening?

@yuyichao
Copy link
Contributor

yuyichao commented Mar 2, 2017

You need to run the exact same loop. The warm up time is the compilation of the threading callback.

@ranjanan
Copy link
Contributor Author

ranjanan commented Mar 2, 2017

In this particular case, it is the exact same loop. Compare the threaded loop in the warmup code in the comment above with the following:

function test2(x::Vector{Float64}, y::Vector{Float64})                          
    t2 = @elapsed @threads for i = 1:10^6                                       
        y[i] = sin(x[i])^2 + cos(x[i])^2                                        
    end                                                                         
    @assert sum(y) == 10^6                                                      
    t2                                                                          
end

I guess it is also to do with the compilation time of test2 which contains the threaded loop. That is why if you run test2 twice, or call it with smaller arrays first for warmup, you can get the speedup you want on the second run.

In which case, I suppose my warmup function isn't quite doing anything because calling test1 and test2 the first time has compilation overhead too, maybe even more so in the case of test2 because of compilation of the threading callback.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 2, 2017

In this particular case, it is the exact same loop

It needs to be the same code as in the exactly same line. What matters is the closure identity.
The warmup function shouldn't be needed anyway as long as the code is type stable. You only need to actually call something for warmup in a function if that part is type unstable. In this case, the call to the closure isn't visible to the compiler so that's what need to be warmed up.

@ihnorton
Copy link
Member

x-ref discourse thread where users report bad scaling with PROFILE_JL_THREADING = 1 (the default) due to their systems falling back from TSC to HPET:

https://discourse.julialang.org/t/thread-overhead-variability-across-machines/7320/5

@JeffBezanson JeffBezanson modified the milestones: 0.5.x, 1.x May 21, 2018
@SohamTamba
Copy link

SohamTamba commented Jul 24, 2018

@yuyichao

My code is of the form

function F(g::AbstractGraph{T}) where T <: Integer
    n::Int64 = nthreads()
    cur::Vector{T} = zeros(T, n)
    next::Vector{T} = zeros(T, n)

    @threads for i in S
            DO SOMETHING with next and cur
    end

    next , cur = cur::Vector{T}, next::Vector{T}
end

Replacing the last line with:

    for t in 1:n
        next[t] , cur[t] = cur[t], next[t]
    end

Fixes the type instability.

Is there any way to swap the array pointers instead of the contents (Code 1 instead of Code 2) while maintaining type stability.

@SohamTamba
Copy link

SohamTamba commented Jul 24, 2018

@yuyichao

My code is of the form

function F(g::AbstractGraph{T}) where T <: Integer
    level::T = one(T)
    @threads for i in S
            READ ONLY ACCESS TO level
    end

    level += one(T)
end

Remove the last line level += one(T) fixes the type instability.

Using level = Ref(one(T)) also fixes this problem.

@vchuravy
Copy link
Member

@SohamTamba these cases look more like #15276 and #24688 and can be worked around by using a let block.

function F(g::AbstractGraph{T}) where T <: Integer
    level = one(T)
    let level=level
    @threads for i in S
            READ ONLY ACCESS TO level
    end
    end

    level += one(T)
end

@KristofferC
Copy link
Member

Is there anything left for us to do here?

@SohamTamba
Copy link

Sorry, forgot to reply to this.

That worked @vchuravy
Thanks

@vchuravy vchuravy closed this as completed Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

No branches or pull requests