-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diminishing returns with increasing number of threads #200
Comments
How many CPUs do you have? It might be that once you are using all of your cores you start to lose performance from excess thread switching. |
@j-f1 , Ryzen 3700X has 8/16 cores and threads respectively. |
I know after 4 threads, performance getting lower quickly on 8 cores/16t
machine I am using.
…On Tue, 29 Nov 2022, 06:43 savchenko, ***@***.***> wrote:
@j-f1 <https://github.com/j-f1> , Ryzen 3700X has 8/16 cores and threads
respectively.
—
Reply to this email directly, view it on GitHub
<#200 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF5JAEDXSZXJHR46PQOXBWDWKWCYLANCNFSM6AAAAAASOA5X5U>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless. |
Yes, there must be some computational limitation, I guess. But anyway, it's
very good at the end.
…On Tue, Nov 29, 2022 at 7:27 AM savchenko ***@***.***> wrote:
@RYucel <https://github.com/RYucel> , as you can see from the graph
above, there is still a benefit of ~250ms from increasing number of threads
from 4 to 6. Anything higher is indeed pointless.
—
Reply to this email directly, view it on GitHub
<#200 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF5JAEDI3OLLXTBD53QYIVLWKWH2LANCNFSM6AAAAAASOA5X5U>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*Rüştü YÜCEL*
Planning Expert
M.Financial&Actuarial Engineering
State Planning Organization-North Cyprus
|
Yes, I observe the same behaviour on M1 Pro - 7 threads is the sweet spot. My explanation is that the computation becomes memory-bound at some point, so you stop gaining performance with more CPU power. It's the memory that limits us. |
I've been running some tests under Superluminal, and I believe I'm seeing some waste when running on multiple threads. The way ggml works, it spawns new threads every time The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work. The rest is calls to It looks like trying to make these threads longer-lived and using a lighter synchronization mechanism should bring some nice perf gains here. |
Thanks for this analysis! Regarding the |
@savchenko Which model did you use and what was the duration of the audio segment used for testing? |
@debasish-mihup, |
@ggerganov I profile it with FlameGraph, on my linux host. I'm not familiar with C++, but from the code I guess decrease the thread number can help reduce the busy waiting time. |
Does the length of input affect the quality of output? These are times from my CPU (AMD Ryzen 5 3600 with 6 cores / 12 threads) with different number of threads: Two parallel tasks: I suppose it should be possible to get much closer to the ideal time (779 815 ms / 12 threads = 64 984 ms). It would just require to find the right place where to cut the original audio without splitting any word. Actually, skipping silent parts (audio gate) would also help. |
I tried to eliminate thread creation/joining in #343 but performance did not improve. My hypothesis is that mutex locks are actually very expensive - more expensive than creating and joining threads. But not sure if I am correct .. I agree that there is a lot of performance to be gained in the Decoder. The |
There's something very wrong with the multithreading support. |
Something like 80% of the total computation time is spent in ggml_graph_compute_thread calling atomic_load. |
I collected data on two of the many-core server systems I have in my lab, both aarch64. I used a Chinese audio file which is 73 seconds long, and tested with the latest mainline build using a quantized int-5 model:
The Huawei machine has 48 cores on an SoC, and the Ampere machine has 80 cores on an SoC. Neither has SMT. I ran a few different trials and took the best time for each thread count. The best time on the Huawei was with 13 threads, and for Ampere it was at 20 threads. The Ampere machine has large private L2 caches; when we bind the threads so the OS doesn't schedule them all over the place, we retain hot caches (for data and locks) which leads to better CPU usage. Although that is for the region on the right after we have already hit our minimum with 16 threads. Using 80 threads is twice as slow as using 16 threads. Maybe there just isn't enough work to create efficiency past 16 threads? Are there knobs to partition the work at coarser granularity per thread?
|
It seems like 7 threads is a sweet-spot after which performance starts decreasing:
Is this expected?
The text was updated successfully, but these errors were encountered: