Diminishing returns with increasing number of threads #200

savchenko · 2022-11-29T04:11:05Z

It seems like 7 threads is a sweet-spot after which performance starts decreasing:

Is this expected?

Latest build from the GitHub Workflows
Windows 21H2
AMD 3700X

The text was updated successfully, but these errors were encountered:

j-f1 · 2022-11-29T04:42:17Z

How many CPUs do you have? It might be that once you are using all of your cores you start to lose performance from excess thread switching.

savchenko · 2022-11-29T04:43:36Z

@j-f1 , Ryzen 3700X has 8/16 cores and threads respectively.

RYucel · 2022-11-29T05:24:33Z

I know after 4 threads, performance getting lower quickly on 8 cores/16t machine I am using.

…

On Tue, 29 Nov 2022, 06:43 savchenko, ***@***.***> wrote: @j-f1 <https://github.com/j-f1> , Ryzen 3700X has 8/16 cores and threads respectively. — Reply to this email directly, view it on GitHub <#200 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF5JAEDXSZXJHR46PQOXBWDWKWCYLANCNFSM6AAAAAASOA5X5U> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

savchenko · 2022-11-29T05:26:51Z

@RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless.

RYucel · 2022-11-29T06:04:51Z

Yes, there must be some computational limitation, I guess. But anyway, it's very good at the end.

…

On Tue, Nov 29, 2022 at 7:27 AM savchenko ***@***.***> wrote: @RYucel <https://github.com/RYucel> , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless. — Reply to this email directly, view it on GitHub <#200 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF5JAEDI3OLLXTBD53QYIVLWKWH2LANCNFSM6AAAAAASOA5X5U> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- *Rüştü YÜCEL* Planning Expert M.Financial&Actuarial Engineering State Planning Organization-North Cyprus

ggerganov · 2022-12-01T17:21:45Z

@savchenko

Yes, I observe the same behaviour on M1 Pro - 7 threads is the sweet spot.
Thanks for pointing out - I actually thought that 8 threads is best.

My explanation is that the computation becomes memory-bound at some point, so you stop gaining performance with more CPU power. It's the memory that limits us.

jonvaldes · 2022-12-04T23:48:42Z

I've been running some tests under Superluminal, and I believe I'm seeing some waste when running on multiple threads.

The way ggml works, it spawns new threads every time ggml_graph_compute is invoked, but in some cases in whisper_cpp this gets pretty bad, especially on whisper_decode. For example, here's how one of those invocations looks like in Superluminal:

The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work. The rest is calls to atomic_load, or overhead from creating and destroying the thread.

It looks like trying to make these threads longer-lived and using a lighter synchronization mechanism should bring some nice perf gains here.

ggerganov · 2022-12-05T15:20:33Z

@jonvaldes

Thanks for this analysis!
I guess I will have to make the threads wait on a condition variable instead of joining them when the ggml_graph_compute finishes.

Regarding the atomic_load - once the threads are started, I found that using a busy loop on atomic counter is much more efficient compared to waiting + notify a condition variable. Of course, it is probably more energy wasteful, but since I am more interested in performance it was better. I think I can add a "low-power" mode where instead of busy loops we use the standard mechanism with condition variable. Would make the CPU go less crazy.

debasish-mihup · 2022-12-08T06:31:48Z

@savchenko Which model did you use and what was the duration of the audio segment used for testing?

savchenko · 2022-12-08T06:42:44Z

@debasish-mihup, medium.en and "long enough to run for many minutes".

yujinqiu · 2022-12-21T02:31:26Z

@ggerganov I profile it with FlameGraph, on my linux host.
with thread 8, you can see that ggml_compute_forward_mul_mat only use about 24.71% cpu time, but 72.53%(97.24 - 24.71) cpu time is wasted, I suspect this is the reason why metal don't work as expect, it's not the bottleneck.

I'm not familiar with C++, but from the code I guess decrease the thread number can help reduce the busy waiting time.
here is the thread 4 FlameGraph, you can see that now ggml_compute_forward_mul_mat 63.21% is doing actual work, only 32.19% (95.4 - 63.21) cpu time is busy waiting,

vitacon · 2022-12-25T14:38:56Z

The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work.

Does the length of input affect the quality of output?
Would not be a more efficient solution to stop creating these micro-threads and basically split the input audio to several segments and let each segment to be munched by a separate thread? (Of course, the results would have to be pasted together from different threads,)

These are times from my CPU (AMD Ryzen 5 3600 with 6 cores / 12 threads) with different number of threads:
1 779815.88
2 441046.56
4 277384.97
6 252671.91
8 236560.52
10 214721.44
11 203417.19
12 208065.34

Two parallel tasks:
6+6 183298.86 ms

I suppose it should be possible to get much closer to the ideal time (779 815 ms / 12 threads = 64 984 ms). It would just require to find the right place where to cut the original audio without splitting any word. Actually, skipping silent parts (audio gate) would also help.

ggerganov · 2022-12-29T11:46:15Z

I tried to eliminate thread creation/joining in #343 but performance did not improve. My hypothesis is that mutex locks are actually very expensive - more expensive than creating and joining threads. But not sure if I am correct ..

I agree that there is a lot of performance to be gained in the Decoder. The ggml_graph_compute is called many many times and there is significant overhead from these calls. But I don't know what is the best way to improve this yet

Ono-Sendai · 2023-03-26T08:02:34Z

There's something very wrong with the multithreading support.
I have a Ryzen 5950X (16 cores, 32 hardware threads).
Setting n_threads = 16 gives inference times (2 trials performed):
5.22 s,
4.99 s.
Setting n_threads = 32 gives inference times:
124.09 s
196.79 s

Ono-Sendai · 2023-03-26T08:13:40Z

Something like 80% of the total computation time is spent in ggml_graph_compute_thread calling atomic_load.

heshpdx · 2023-06-04T04:36:41Z

I collected data on two of the many-core server systems I have in my lab, both aarch64. I used a Chinese audio file which is 73 seconds long, and tested with the latest mainline build using a quantized int-5 model:

./main -t 1 -l chinese -d 73300 -bo 5 -m models/ggml-model-whisper-base-q5_1.bin -f Tencent-chinese.wav

The Huawei machine has 48 cores on an SoC, and the Ampere machine has 80 cores on an SoC. Neither has SMT. I ran a few different trials and took the best time for each thread count. The best time on the Huawei was with 13 threads, and for Ampere it was at 20 threads.

The Ampere machine has large private L2 caches; when we bind the threads so the OS doesn't schedule them all over the place, we retain hot caches (for data and locks) which leads to better CPU usage. Although that is for the region on the right after we have already hit our minimum with 16 threads. Using 80 threads is twice as slow as using 16 threads. Maybe there just isn't enough work to create efficiency past 16 threads? Are there knobs to partition the work at coarser granularity per thread?

for i in `seq 1 80`; do num=$((i-1)); time=`perf stat taskset -c 0-$num ./main -l chinese -d 73300 -bo 5 -m models/ggml-model-whisper-base-q5_1.bin -f Tencent-chinese.wav -t $i |& grep seconds\ time | awk '{print $1}'`; echo "$i,$time" >> stats.csv

ggerganov added the performance CPU and memory usage - results and comparisons label Dec 1, 2022

ggerganov mentioned this issue Dec 29, 2022

Attempt to improve threading in ggml #343

Draft

ggerganov mentioned this issue Dec 29, 2022

If hyper threaded it will lag when going over half. #337

Closed

misutoneko mentioned this issue Mar 20, 2023

Incredibly slow on Windows with CPU having AVX support #630

Open

heshpdx mentioned this issue Jun 4, 2023

Benchmark results #89

Open

sigaloid mentioned this issue Jul 4, 2023

Recommendations for performance when running whisper.cpp on VPS? #524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diminishing returns with increasing number of threads #200

Diminishing returns with increasing number of threads #200

savchenko commented Nov 29, 2022

j-f1 commented Nov 29, 2022

savchenko commented Nov 29, 2022

RYucel commented Nov 29, 2022 via email

savchenko commented Nov 29, 2022

RYucel commented Nov 29, 2022 via email

ggerganov commented Dec 1, 2022

jonvaldes commented Dec 4, 2022 •

edited

Loading

ggerganov commented Dec 5, 2022

debasish-mihup commented Dec 8, 2022

savchenko commented Dec 8, 2022

yujinqiu commented Dec 21, 2022

vitacon commented Dec 25, 2022 •

edited

Loading

ggerganov commented Dec 29, 2022

Ono-Sendai commented Mar 26, 2023

Ono-Sendai commented Mar 26, 2023

heshpdx commented Jun 4, 2023 •

edited

Loading

Diminishing returns with increasing number of threads #200

Diminishing returns with increasing number of threads #200

Comments

savchenko commented Nov 29, 2022

j-f1 commented Nov 29, 2022

savchenko commented Nov 29, 2022

RYucel commented Nov 29, 2022 via email

savchenko commented Nov 29, 2022

RYucel commented Nov 29, 2022 via email

ggerganov commented Dec 1, 2022

jonvaldes commented Dec 4, 2022 • edited Loading

ggerganov commented Dec 5, 2022

debasish-mihup commented Dec 8, 2022

savchenko commented Dec 8, 2022

yujinqiu commented Dec 21, 2022

vitacon commented Dec 25, 2022 • edited Loading

ggerganov commented Dec 29, 2022

Ono-Sendai commented Mar 26, 2023

Ono-Sendai commented Mar 26, 2023

heshpdx commented Jun 4, 2023 • edited Loading

jonvaldes commented Dec 4, 2022 •

edited

Loading

vitacon commented Dec 25, 2022 •

edited

Loading

heshpdx commented Jun 4, 2023 •

edited

Loading