-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No performance scaling in threading #17395
Comments
@JeffBezanson Julia is aware of the types of |
Are you deliberately benchmarking what happens when you don't put it in a function? Note that the times inside a function are two orders of magnitude faster: using Base.Threads, BenchmarkTools
function test1!(y, x)
@assert length(y) == length(x)
for i = 1:length(x)
y[i] = sin(x[i])^2 + cos(x[i])^2
end
y
end
function testn!(y, x)
@assert length(y) == length(x)
@threads for i = 1:length(x)
y[i] = sin(x[i])^2 + cos(x[i])^2
end
y
end I have little experience playing with the threads feature, but one thing I've noticed is that my performance fluctuates a lot from session to session: on a laptop with 2 "real" cores and 2 threads, in some sessions I can get a nearly 2x improvement, while if I quit and restart julia I might get a 4x worsening. Quit & restart again, and perhaps I'm back to the 2x improvement. Weird. |
Is OSX and linux (ubuntu?) installed on the same machine? |
@timholy Of course, it's best not to use globals when benchmarking. However, this regression is not supposed to happen irrespective of whether you put everything in functions or not. using Base.Threads
function driver()
println("Number of threads = $(nthreads())")
x = rand(10^6)
y = zeros(10^6)
println("Warmup!")
warmup(x, y)
t1 = test1(x, y)
t2 = test2(x, y)
println("Serial time = $t1")
println("Parallel time = $t2")
end
function warmup(x::Vector{Float64}, y::Vector{Float64})
for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
end
function test1(x::Vector{Float64}, y::Vector{Float64})
t1 = @elapsed for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t1
end
function test2(x::Vector{Float64}, y::Vector{Float64})
t2 = @elapsed @threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2
end
driver() gives
And yes, sometimes threading performance is flaky at times. AFAIK, this is because of unresolved gc issues. |
@pkofod I first saw this on OSX and then verified it on a different Linux machine. The point I was trying to make was not the numbers themselves, but the fact that there seems to be no scaling. |
This seems to be a openlibm issue. Using the system libm here have good performance scaline. |
Also note that the type of |
Script I use for benchmarking using Base.Threads
println("Number of threads = $(nthreads())")
# sin1(x::Float64) = ccall((:sin, Base.Math.libm), Float64, (Float64,), x)
# cos1(x::Float64) = ccall((:cos, Base.Math.libm), Float64, (Float64,), x)
sin1(x::Float64) = ccall(:sin, Float64, (Float64,), x)
cos1(x::Float64) = ccall(:cos, Float64, (Float64,), x)
function test1!(y, x)
# @assert length(y) == length(x)
for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
function testn!(y::Vector{Float64}, x::Vector{Float64})
# @assert length(y) == length(x)
Threads.@threads for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
n = 10^7
x = rand(n)
y = zeros(n)
@time test1!(y, x)
@time testn!(y, x)
@time test1!(y, x)
@time testn!(y, x) With
With glibc libm
|
More typical timing (glibc libm and openlibm)
|
Interestingly, I can't reproduce this in C with either openmp or cilk...... |
glibc seems slower than openlibm serially but scales with threading... |
Hmm, interesting, on another machine I got. yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libopenlibm"
3.561010 seconds
1.213387 seconds (20 allocations: 640 bytes)
yuyichao% JULIA_NUM_THREADS=4 ./julia ../a.jl
Number of threads = 4
libm_name = "libm"
2.449299 seconds
0.853167 seconds (20 allocations: 640 bytes) using using Base.Threads
println("Number of threads = $(nthreads())")
# const libm_name = "libopenlibm"
const libm_name = "libm"
@show libm_name
sin1(x::Float64) = ccall((:sin, libm_name), Float64, (Float64,), x)
cos1(x::Float64) = ccall((:cos, libm_name), Float64, (Float64,), x)
@noinline function test1!(y, x)
# @assert length(y) == length(x)
for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
@noinline function testn!(y::Vector{Float64}, x::Vector{Float64})
# @assert length(y) == length(x)
Threads.@threads for i = 1:length(x)
y[i] = sin1(x[i])^2 + cos1(x[i])^2
end
y
end
function run_tests()
n = 10^7
x = rand(n)
y = zeros(n)
test1!(y, x)
testn!(y, x)
@time for i in 1:10
test1!(y, x)
end
@time for i in 1:10
testn!(y, x)
end
end
run_tests() So both of the scales and |
For the record, I'm using LLVM 3.8.0 on both machine. And the C code I use is /* #include <cilk/cilk.h> */
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
double (*sin1)(double) = NULL;
double (*cos1)(double) = NULL;
uint64_t get_time(void)
{
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
}
static inline double kernel(double x)
{
double s = sin1(x);
double c = cos1(x);
return s * s + c * c;
}
__attribute__((noinline)) void test1(double *y, double *x, size_t n)
{
asm volatile("":::"memory");
for (size_t i = 0;i < n;i++) {
y[i] = kernel(x[i]);
}
asm volatile("":::"memory");
}
/* __attribute__((noinline)) void testn(double *y, double *x, size_t n) */
/* { */
/* asm volatile("":::"memory"); */
/* cilk_for (size_t i = 0;i < n;i++) { */
/* y[i] = kernel(x[i]); */
/* } */
/* asm volatile("":::"memory"); */
/* } */
__attribute__((noinline)) void testn2(double *y, double *x, size_t n)
{
asm volatile("":::"memory");
#pragma omp parallel for
for (size_t i = 0;i < n;i++) {
y[i] = kernel(x[i]);
}
asm volatile("":::"memory");
}
__attribute__((noinline)) void run_tests(double *y, double *x, size_t n)
{
uint64_t t_start = get_time();
test1(y, x, n);
uint64_t t_end = get_time();
printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
/* t_start = get_time(); */
/* testn(y, x, n); */
/* t_end = get_time(); */
/* printf("time: %.3f ms\n", (t_end - t_start) / 1e6); */
t_start = get_time();
testn2(y, x, n);
t_end = get_time();
printf("time: %.3f ms\n", (t_end - t_start) / 1e6);
}
int main()
{
void *hdl = dlopen("libm.so.6", RTLD_NOW);
/* void *hdl = dlopen("libopenlibm.so", RTLD_NOW); */
sin1 = (double(*)(double))dlsym(hdl, "sin");
cos1 = (double(*)(double))dlsym(hdl, "cos");
size_t n = 10000000;
double *x = malloc(sizeof(double) * n);
/* double *x = calloc(n, sizeof(double)); */
double *y = calloc(n, sizeof(double));
for (size_t i = 0;i < n;i++) {
x[i] = 1;
}
run_tests(y, x, n);
run_tests(y, x, n);
return 0;
} (the cilkplus version is commented out for clang....) |
I see good scaling once there is enough data to process to overcome the setup costs. |
I don't think the setup is the issue? It doesn't explain the difference between openlibm and system libm (system libm being faster with multiple thread) and there are already 10^6 elements in the test. |
@vtjnash Can you post the problem size and perf vs thread size? Because I do not see that, with openlibm atleast. With glibc, I see some nominal scaling. |
Ah, OK. Profiling is showing wide variation in the number of hits in the |
So this issue seems to be really system dependent. Here's my observation so far but I'm not sure if others have similar observations. The scripts I used can be found here if others want to reproduce. TL;DR, the slowness I've seen on the two systems I've tested can be fully explained by a single thread slowdown that can mysteriously disappear due to certain operations on the thread. Here are some of the main observations that leads to the conclusion above:
The effect also goes away when the loop is moved to C or when the |
I'm actually getting that the performance is scaling worse than serial on a problem. Is this the same issue? https://gist.github.com/ChrisRackauckas/6970aa6c3fa42c987b63dc9fe21c48fd |
It's unclear what this problem is right now. |
Perhaps this is the same issue showing up in #17751 |
Running the code from #17395 (comment) I get good scaling
Running: #17395 (comment)
I get worse performance with parallel than not. (as did OP) and running #17395 (comment)
So same performance on with both (as did the OP) So all the results seem more or less the same on my system, as the posted results. |
I tracked down the issue in my code to be due to |
Note that one of the trick I've been using to workaround #15276 that works pretty well for threading loops is to put just the loop itself in a separate function, use a ref if you need to update local variable. |
Hi, is there any progress on this issue or any new workarounds that are not mentioned in this thread? I was porting some C++ code to julia, but had to abandon the project since I got 1.5 scaling of 8 threads. I also can still reproduce the results mentioned before in this thread (where scaling close to 1.0 happens with 8 threads). Edit: also, maybe it's just my ignorance, but julia> versioninfo()
Julia Version 0.6.0-dev.674
Commit 8cb72ee (2016-09-16 12:29 UTC)
Platform Info:
System: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> using Base.Threads
julia> function rfill!(A)
for i in 1:length(A)
A[i] = rand()
end
end
julia> function rfill_mt!(A)
@threads for i in 1:length(A)
A[i] = rand()
end
end
julia> function driver()
A = zeros(64000000);
println("num threads: $(nthreads())")
rfill!(A);
rfill_mt!(A);
@time rfill!(A)
@time rfill_mt!(A)
end
julia> driver()
num threads: 8
0.222452 seconds
0.554084 seconds (2 allocations: 48 bytes) |
I think rand isn't threadsafe, so that's not meaningful. The original bug here appears to be a hardware issue that could hit you at any time (even single threaded) and on any code (even if you used asm instead of C++) |
For me at least it seems that the scaling is quite poor across few benchmarks, not only the ones mentioned above. Also, it seems quite unfortunate that code using random numbers will run slower when used with mulithreading. |
That's just because what's in Base is not thread safe. But RNG.jl is trying to make some thread-safe ones I believe. |
So there appears to be multiple issues here but at least we've made some progress on the issue I observed. To summarize, the issue I've seen is that there are significant slow down when calling This issue is actually caused by the slow down when mixing avx and sse instructions and is made more confusing by a glibc bug that'll be fixed in the next release. The issue was mainly solved by going through the performance counter and see which one has significant difference between the good and the bad version, many thanks to @Keno and @kpamnany for advices on the right counter/processor manual to look at.... Explaination for each features:
Fixing this bug will probably involve fixing openlibm to use |
@ranjanan Just trying to see if there's anything left here that's not related to mixing avx and sse. I also can't figure out (or find) if you have good scaling using the system libm instead.
If you see good scaling with There are two other issues in this thread #17395 (comment) and #17395 (comment). Those should probably be separate issues if they are still not solved. |
Just to clarify things, I'm running the benchmark codes again right now. On running code from #17395 (comment), which is the original benchmark code, I get:
which shows no scaling. On running code from #17395 (comment),
Interestingly, both of these show (~ 2.4x) scaling. But I suppose the difference between both the above benchmark codes is that there is 100x more data parallelism in the second benchmark code (size 10^7, and running it 10 times). Now addressing both your points:
I hope this summary helps. |
Can you run it twice? I get: julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.012560089
Parallel time = 0.019910309
julia> driver()
Number of threads = 4
Warmup!
Serial time = 0.013349351
Parallel time = 0.004878786 There are additional warm up necessary for threaded case (inference / compilation of the callback I imagine).
We don't ship libc (and not glibc), that glibc question is for the linux one. According to Jameson, the macOS libc plt callback uses XSAVE and shouldn't have this problem. (though other things can still put it in a dirty state.) Given that you don't see a difference adding |
Ah, you are right:
I modified my warmup code to include a threaded loop now. function warmup(x::Vector{Float64}, y::Vector{Float64})
for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
end But this doesn't seem to help:
Of course, if I call |
You need to run the exact same loop. The warm up time is the compilation of the threading callback. |
In this particular case, it is the exact same loop. Compare the threaded loop in the warmup code in the comment above with the following: function test2(x::Vector{Float64}, y::Vector{Float64})
t2 = @elapsed @threads for i = 1:10^6
y[i] = sin(x[i])^2 + cos(x[i])^2
end
@assert sum(y) == 10^6
t2
end I guess it is also to do with the compilation time of In which case, I suppose my warmup function isn't quite doing anything because calling |
It needs to be the same code as in the exactly same line. What matters is the closure identity. |
x-ref discourse thread where users report bad scaling with https://discourse.julialang.org/t/thread-overhead-variability-across-machines/7320/5 |
My code is of the form
Replacing the last line with:
Fixes the type instability. Is there any way to swap the array pointers instead of the contents (Code 1 instead of Code 2) while maintaining type stability. |
My code is of the form
Remove the last line Using |
@SohamTamba these cases look more like #15276 and #24688 and can be worked around by using a let block.
|
Is there anything left for us to do here? |
Sorry, forgot to reply to this. That worked @vchuravy |
Consider the following code:
The output recorded on OSX is:
The output recorded on Linux is:
The text was updated successfully, but these errors were encountered: