Threadpool: take 2 #8672

fmz · 2024-07-24T15:13:16Z

ref: original PR #7526

Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.

With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

fmz · 2024-07-24T16:34:49Z

Here are some perf figures:

On W-2225 Xeon machine: CPU backend:

CPU	Model	Test	t/s master	t/s threadpool	Speedup
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz	llama 7B Q4_0	pp512	17.46	17.51	1.00
Intel(R) Xeon(R) W-2225 CPU @ 4.10GHz	llama 7B Q4_0	tg128	6.98	7.06	1.01

Intel 10th-gen CPU:
./scripts/compare-commits.sh master threadpool -t 1,2,4,6,8,10

CPU	Model	Threads	Test	t/s master	t/s threadpool-attempt-2	Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	1	pp512	3.93	3.94	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	1	tg128	2.43	2.44	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	2	pp512	7.13	7.06	0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	2	tg128	4.37	4.36	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	4	pp512	11.96	11.99	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	4	tg128	6.79	6.77	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	6	pp512	14.96	14.98	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	6	tg128	7.51	7.53	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	8	pp512	13.06	13.09	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	8	tg128	6.88	6.83	0.99
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	10	pp512	14.08	14.06	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	10	tg128	7.49	7.52	1.00

Mobile NVIDIA 3060:
$ LLAMA_CUDA=1 ./scripts/compare-commits.sh master threadpool -nkvo 0,1

GPU	Model	NKVO	Test	t/s master	t/s threadpool-attempt-2	Speedup
RTX 3060 Laptop GPU	llama 7B Q4_0	No	pp512	1644.73	1642.34	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	No	tg128	65.94	65.89	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	pp512	287.28	286.44	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	tg128	54.56	54.32	1.00

fmz · 2024-07-26T16:01:42Z

@slaren Threadpool is back!
Updated it a bit to be aligned with the latest graph-compute design. The current performance is largely on par with OpenMP.
Please lmk if you have any comments/suggestions?

slaren · 2024-07-26T18:04:45Z

I tried to test this on macOS, but it seems to deadlock.

WARNING: ThreadSanitizer: data race (pid=62377)
  Write of size 1 at 0x00010ab02a8e by main thread:
    #0 ggml_graph_compute ggml.c:19365 (llama-bench:arm64+0x10003fb54)
    #1 ggml_backend_cpu_graph_compute ggml-backend.c:822 (llama-bench:arm64+0x1000a5f1c)
    #2 ggml_backend_graph_compute_async ggml-backend.c:282 (llama-bench:arm64+0x10009bac0)
    #3 ggml_backend_sched_compute_splits ggml-backend.c:1795 (llama-bench:arm64+0x1000a3190)
    #4 ggml_backend_sched_graph_compute_async ggml-backend.c:1979 (llama-bench:arm64+0x1000a2d24)
    #5 llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_compute_threadpool*) llama.cpp:14412 (llama-bench:arm64+0x100292cac)
    #6 llama_decode_internal(llama_context&, llama_batch) llama.cpp:14666 (llama-bench:arm64+0x1000fda4c)
    #7 llama_decode llama.cpp:18489 (llama-bench:arm64+0x1000fc460)
    #8 test_prompt(llama_context*, int, int, int, int) llama-bench.cpp:1319 (llama-bench:arm64+0x10062bd5c)
    #9 main llama-bench.cpp:1454 (llama-bench:arm64+0x100627180)

  Previous read of size 1 at 0x00010ab02a8e by thread T12 (mutexes: write M0):
    #0 ggml_graph_compute_check_for_work ggml.c:19152 (llama-bench:arm64+0x100053a10)
    #1 ggml_graph_compute_secondary_thread ggml.c:19189 (llama-bench:arm64+0x1000537dc)

  Location is heap block of size 192 at 0x00010ab02a00 allocated by main thread:
    #0 posix_memalign <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x564c0)
    #1 ggml_aligned_malloc ggml.c:241 (llama-bench:arm64+0x10001ac88)
    #2 ggml_create_threadpool_impl ggml.c:19214 (llama-bench:arm64+0x10003f14c)
    #3 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #4 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Mutex M0 (0x00010ab02a00) created at:
    #0 pthread_mutex_init <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x31470)
    #1 ggml_create_threadpool_impl ggml.c:19238 (llama-bench:arm64+0x10003f404)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

  Thread T12 (tid=36579987, running) created by main thread at:
    #0 pthread_create <null>:96112260 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3062c)
    #1 ggml_create_threadpool_impl ggml.c:19277 (llama-bench:arm64+0x10003f638)
    #2 ggml_create_threadpool ggml.c:19292 (llama-bench:arm64+0x10003f0b8)
    #3 main llama-bench.cpp:1418 (llama-bench:arm64+0x100626cd0)

SUMMARY: ThreadSanitizer: data race ggml.c:19365 in ggml_graph_compute

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fc00) at ggml.c:19132:5
    frame #2: 0x0000000104ba17ec llama-bench`ggml_graph_compute(cgraph=0x00000001182901b8, cplan=0x000000016b28a730) at ggml.c:19373:5
    frame #3: 0x0000000104be8400 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:822:12
    frame #4: 0x0000000104be23c4 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000039680b0, cgraph=0x00000001182901b8) at ggml-backend.c:282:12
    frame #5: 0x0000000104be6834 llama-bench`ggml_backend_sched_compute_splits(sched=0x0000000115000000) at ggml-backend.c:1795:35
    frame #6: 0x0000000104be65a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x0000000115000000, graph=0x0000000118420020) at ggml-backend.c:1979:12
    frame #7: 0x0000000104d09f44 llama-bench`llama_graph_compute(lctx=0x0000000114813e00, gf=0x0000000118420020, n_threads=12, threadpool=0x0000600003b6c3c0) at llama.cpp:14412:5
    frame #8: 0x0000000104c2b148 llama-bench`llama_decode_internal(lctx=0x0000000114813e00, batch_all=llama_batch @ 0x000000016b28ac60) at llama.cpp:14666:9
    frame #9: 0x0000000104c2a15c llama-bench`llama_decode(ctx=0x0000000114813e00, batch=llama_batch @ 0x000000016b28ad08) at llama.cpp:18489:21
    frame #10: 0x0000000104f3ecbc llama-bench`test_prompt(ctx=0x0000000114813e00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x0000000104f3ae44 llama-bench`main(argc=9, argv=0x000000016b28b940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x000000012481fe20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x000000012481fe20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820040) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820040) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820260) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820260) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820480) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820480) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x00000001248206a0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248206a0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248208c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248208c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820ae0) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820ae0) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124820d00) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820d00) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124820f20) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124820f20) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x0000000104baf664 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3018:29
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821140) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821140) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x0000000104baf690 llama-bench`ggml_barrier(threadpool=0x0000600003b6c3c0) at ggml.c:3019:54
    frame #1: 0x0000000104ba1a9c llama-bench`ggml_graph_compute_thread(data=0x0000000124821360) at ggml.c:19132:5
    frame #2: 0x0000000104baeb18 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821360) at ggml.c:19191:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821580) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821580) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248217a0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248217a0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001248219c0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001248219c0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104baecf4 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000124821be0) at ggml.c:19154:17
    frame #3: 0x0000000104baeaf8 llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000124821be0) at ggml.c:19189:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

fmz · 2024-07-26T20:18:05Z

I tried to test this on macOS, but it seems to deadlock.

Fixed!

fmz · 2024-07-26T20:38:01Z

On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1)

Model	Threads	Test	t/s master	t/s threadpool	Speedup
llama 7B Q4_0	4	pp512	32.97	34.87	1.06
llama 7B Q4_0	4	tg128	18.01	18.37	1.02
llama 7B Q4_0	6	pp512	47.43	48.99	1.03
llama 7B Q4_0	6	tg128	23.10	23.32	1.01
llama 7B Q4_0	8	pp512	49.90	55.17	1.11
llama 7B Q4_0	8	tg128	18.09	21.98	1.22
llama 7B Q4_0	10	pp512	52.50	56.69	1.08
llama 7B Q4_0	10	tg128	14.24	8.54	0.60
llama 7B Q4_0	12	pp512	56.37	56.93	1.01
llama 7B Q4_0	12	tg128	5.02	9.44	1.88

fmz · 2024-07-26T21:06:50Z

Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?)

Model	Threads	Test	t/s master	t/s threadpool	Speedup
llama 8B Q4_0_4_4	4	pp512	72.44	72.83	1.01
llama 8B Q4_0_4_4	4	tg128	22.29	23.50	1.05
llama 8B Q4_0_4_4	6	pp512	98.71	100.21	1.02
llama 8B Q4_0_4_4	6	tg128	24.63	24.44	0.99
llama 8B Q4_0_4_4	8	pp512	95.86	116.17	1.21
llama 8B Q4_0_4_4	8	tg128	21.19	26.28	1.24
llama 8B Q4_0_4_4	10	pp512	102.37	105.18	1.03
llama 8B Q4_0_4_4	10	tg128	18.63	16.98	0.91
llama 8B Q4_0_4_4	12	pp512	108.08	101.18	0.94
llama 8B Q4_0_4_4	12	tg128	6.22	11.39	1.83

fmz · 2024-07-29T13:20:03Z

@slaren lmk if it works for you this time

slaren · 2024-07-31T14:01:07Z

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

fmz · 2024-07-31T15:56:40Z

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer...
Thanks for the details!
Looks like we got some trouble in the "ACCELERATE" path
I'll fix it asap

fmz · 2024-07-31T16:47:54Z

I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fc00) at ggml.c:19133:5
    frame #2: 0x0000000102383190 llama-bench`ggml_graph_compute(cgraph=0x00000001085f81c8, cplan=0x000000016daa2730) at ggml.c:19374:5
    frame #3: 0x00000001023c3394 llama-bench`ggml_backend_cpu_graph_compute(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:822:12
    frame #4: 0x00000001023bd840 llama-bench`ggml_backend_graph_compute_async(backend=0x00006000025ec0b0, cgraph=0x00000001085f81c8) at ggml-backend.c:282:12
    frame #5: 0x00000001023c1864 llama-bench`ggml_backend_sched_compute_splits(sched=0x000000010680c400) at ggml-backend.c:1800:35
    frame #6: 0x00000001023c15a0 llama-bench`ggml_backend_sched_graph_compute_async(sched=0x000000010680c400, graph=0x00000001081c0020) at ggml-backend.c:1987:12
    frame #7: 0x00000001024e0b58 llama-bench`llama_graph_compute(lctx=0x000000010680fa00, gf=0x00000001081c0020, n_threads=12, threadpool=0x00006000027e43c0) at llama.cpp:14425:5
    frame #8: 0x0000000102404938 llama-bench`llama_decode_internal(lctx=0x000000010680fa00, batch_all=llama_batch @ 0x000000016daa2c60) at llama.cpp:14679:9
    frame #9: 0x0000000102403a9c llama-bench`llama_decode(ctx=0x000000010680fa00, batch=llama_batch @ 0x000000016daa2d08) at llama.cpp:18499:21
    frame #10: 0x0000000102712eac llama-bench`test_prompt(ctx=0x000000010680fa00, n_prompt=32, n_past=0, n_batch=2048, n_threads=12) at llama-bench.cpp:1319:9
    frame #11: 0x000000010270f0b8 llama-bench`main(argc=9, argv=0x000000016daa3940) at llama-bench.cpp:1454:13
    frame #12: 0x000000018fbae0e0 dyld`start + 2360
  thread #2
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x000000013681fe20) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x000000013681fe20) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820040) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820040) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820260) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820260) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820480) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820480) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368206a0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368206a0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x00000001368208c0) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368208c0) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820ae0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820ae0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #9
    frame #0: 0x00000001023900f8 llama-bench`ggml_barrier(threadpool=0x00006000027e43c0) at ggml.c:3050:29
    frame #1: 0x0000000102383440 llama-bench`ggml_graph_compute_thread(data=0x0000000136820d00) at ggml.c:19133:5
    frame #2: 0x000000010238f67c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820d00) at ggml.c:19192:37
    frame #3: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #10
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136820f20) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136820f20) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #11
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821140) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821140) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #12
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821360) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821360) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #13
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821580) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821580) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #14
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368217a0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368217a0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #15
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x00000001368219c0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x00000001368219c0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #16
    frame #0: 0x000000018fef99ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018ff3755c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000010238f830 llama-bench`ggml_graph_compute_check_for_work(state=0x0000000136821be0) at ggml.c:19155:17
    frame #3: 0x000000010238f65c llama-bench`ggml_graph_compute_secondary_thread(data=0x0000000136821be0) at ggml.c:19190:25
    frame #4: 0x000000018ff36f94 libsystem_pthread.dylib`_pthread_start + 136
  thread #17
    frame #0: 0x000000018fef7ea4 libsystem_kernel.dylib`__workq_kernreturn + 8

Built with LLAMA_DEBUG=1 GGML_NO_METAL=1 make llama-bench && ./llama-bench -m models/llama-2-7b/ggml-model-Q4_0.gguf -n 0 -r 1 -p 32.

Bummer... Thanks for the details! Looks like we got some trouble in the "ACCELERATE" path I'll fix it asap

@slaren turns out there was a bit of a corner case where if you have a graph with only 1 node, ggml_barrier and wait_for_work deadlock on each other.
Added a check to handle that specific case

slaren · 2024-08-01T17:37:59Z

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max:
GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU	Model	Model Size [GiB]	Test	t/s master	t/s threadpool	Speedup
M3 Max	llama 7B Q4_0	3.56	pp512	151.21	149.88	0.99
M3 Max	llama 7B Q4_0	3.56	tg128	30.06	26.09	0.87

13900k + 3090Ti:
OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5699.53 ± 19.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	150.75 ± 1.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	651.63 ± 32.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	63.85 ± 3.22

Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5453.33 ± 216.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	144.45 ± 0.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	566.43 ± 27.64
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	29.54 ± 0.99

build: bebe99c (3500)

fmz · 2024-08-01T17:50:51Z

Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max.

M3 Max: GGML_NO_METAL=1 scripts/compare-commits.sh master threadpool -m models/llama-2-7b/ggml-model-Q4_0.gguf

CPU Model Model Size [GiB] Test t/s master t/s threadpool Speedup
M3 Max llama 7B Q4_0 3.56 pp512 151.21 149.88 0.99
M3 Max llama 7B Q4_0 3.56 tg128 30.06 26.09 0.87
13900k + 3090Ti: OpenMP (GGML_CUDA=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5699.53 ± 19.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 150.75 ± 1.23
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 651.63 ± 32.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 63.85 ± 3.22
Threadpool (GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench && ./llama-bench -nkvo 0,1)

model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5453.33 ± 216.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 144.45 ± 0.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 566.43 ± 27.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 29.54 ± 0.99
build: bebe99c (3500)

Ooof...
That is quite a bit slower. I'll try to replicate this locally

max-krasnyansky · 2024-08-03T23:31:04Z

@fmz @slaren
I fixed one of the issues that was causing regressions. We were setting the default number of threads in the threadpool using std::thread::hardware_concurrency(). I updated that to use cpu_get_num_math() this way we are going to exclude E-Cores and Hypethreading siblings.
This is what was causing regressions with the default cmd line args where the number of threads is not explicitly specified.
We were starting 12 threads on M2 Max, where only 8 cores are really usable, same on AMD EPYC (using siblings) and Intel 13/14th Gen (using E-Cores).

I'm also working on another fix which is specific to llama-bench. Currently (in the threadpool branch) we start a single threadpool with max-num-threads and reuse it for each test. Suppose the test is using 4 threads but we'd start 12 (on M2 Max or Snapdragon X-Elite).
This is suboptimal because the spinning threads interfere with Core boosting and things. It's better to start a fresh threadpool for each test.

max-krasnyansky · 2024-08-04T01:52:47Z

@fmz @slaren
llama-bench has been updated as I described above.

Here are the numbers from M2 Max.
I'll share numbers for an AMD EPYC server, Snapdragon X-Elite and Gen-3 a bit later.

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" GGML_NO_METAL=1 GGML_NO_ACCELERATE=1 \
    ./scripts/compare-commits.sh master threadpool -m ../gguf/llama-v3.1.q4_0_4_8.gguf -ngl 0 -t 4,6,8
...
+ ./scripts/compare-llama-bench.py -b master -c threadpool
| CPU   | Model             |   Threads | Test   |   t/s master |   t/s threadpool |   Speedup |
|:------|:------------------|----------:|:-------|-------------:|-----------------:|----------:|
|       | llama 8B Q4_0_4_8 |         4 | pp512  |        64.43 |            64.52 |      1.00 |
|       | llama 8B Q4_0_4_8 |         4 | tg128  |        22.53 |            24.36 |      1.08 |
|       | llama 8B Q4_0_4_8 |         6 | pp512  |        89.79 |            91.04 |      1.01 |
|       | llama 8B Q4_0_4_8 |         6 | tg128  |        24.73 |            26.21 |      1.06 |
|       | llama 8B Q4_0_4_8 |         8 | pp512  |       117.14 |           118.67 |      1.01 |
|       | llama 8B Q4_0_4_8 |         8 | tg128  |        26.11 |            26.37 |      1.01 |

slaren · 2024-08-08T22:34:39Z

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

Results

GGML_CUDA=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5689.32 ± 13.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	154.53 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	643.28 ± 31.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	64.27 ± 2.21

build: 267bf57 (3554)

GGML_CUDA=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -nkvo 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5674.51 ± 37.77
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	153.30 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	646.42 ± 32.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	62.98 ± 2.94

build: 267bf57 (3554)

GGML_BLIS=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	pp128	47.55 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	tg32	20.79 ± 0.10

build: 267bf57 (3554)

GGML_BLIS=1 GGML_NO_OPENMP=1 make llama-bench > /dev/null && ./llama-bench -p 128 -n 32

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	pp128	33.47 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	BLAS	16	tg32	20.58 ± 0.07

build: 267bf57 (3554)

CPU	Model	Threads	Test	t/s master	t/s threadpool	Speedup
M3 Max	llama 7B all F32	4	pp512	150.03	134.77	0.90
M3 Max	llama 7B all F32	4	tg128	4.76	4.20	0.88
M3 Max	llama 7B all F32	8	pp512	155.66	115.40	0.74
M3 Max	llama 7B all F32	8	tg128	4.76	4.35	0.91
M3 Max	llama 7B all F32	12	pp512	156.19	94.43	0.60
M3 Max	llama 7B all F32	12	tg128	4.66	4.33	0.93
M3 Max	llama 7B Q4_0	4	pp512	142.43	144.89	1.02
M3 Max	llama 7B Q4_0	4	tg128	21.04	20.74	0.99
M3 Max	llama 7B Q4_0	8	pp512	150.08	142.22	0.95
M3 Max	llama 7B Q4_0	8	tg128	28.22	28.14	1.00
M3 Max	llama 7B Q4_0	12	pp512	150.55	120.62	0.80
M3 Max	llama 7B Q4_0	12	tg128	30.10	30.26	1.01
M3 Max	stories260K	4	pp512	52491.62	65492.68	1.25
M3 Max	stories260K	4	tg128	8417.80	12262.68	1.46
M3 Max	stories260K	8	pp512	59893.07	94300.47	1.57
M3 Max	stories260K	8	tg128	3746.70	5639.87	1.51
M3 Max	stories260K	12	pp512	53756.90	115958.90	2.16
M3 Max	stories260K	12	tg128	2507.28	4333.34	1.73

max-krasnyansky · 2024-08-09T00:18:59Z

The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in ggml_graph_compute_check_for_work while the BLAS backend is running. This will also cause the threads to spin while the GPU backend is running when partially offloading, which would be a reggression. Rather than requiring the user to disable polling manually, I suggest implementing some kind of backoff and yield the threads after spinning for a while. The BLAS backend (ggml-blas.cpp) may also benefit from using the threadpool, since it launches several threads to dequantize the weights, and it could also automatically pause the pool during the call to the BLAS library.

@slaren

Awesome! Thanks for checking out the latest. We've been doing lots of profiling and tuning.
Every time I'm about to send an updated perf report on Snapdragons and M2 I find yet another thing to improve :)
In my testing we're doing really well with the CPU backend (especially on the ARM64-based systems), with other backends, as you pointed out, the spinning threads get in the way at times and cause regressions.
I'll try your suggestions.

btw We might just flip the default back to non-polling. Technically polling is only useful for the llama-bench to match OpenMP behavior/numbers in that case. When I looked at the original profiles, I saw that the threadpool is doing a lot more context switches than OpenMP during token-gen test. Polling removes those context switches and we get even better numbers now.
It might make sense to make that a bit of a special case (ie default to polling for the CPU backend bench, otherwise default is non-polling) or some hybrid approach as you suggested.

ggml/src/ggml.c

This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy.

All threadpool related functions and structs use ggml_threadpool prefix.

max-krasnyansky · 2024-08-28T05:21:15Z

@slaren
Most of your comments & suggestions have been addressed.
GGML API has been further cleaned up and simplified. Theadpool switching is now transparent, we switch on ggml_backend_cpu_set_threadpool(). Threadpool params have nice defaults.

Process priority setting has been moved into a helper function in common/common.cpp and only called from the sample apps.

src/llama.cpp looks much simpler now. Just a few lines of extra code that selects the threadpool based on the
number of tokens, same as selecting n_threads. And a couple of API calls to attach threadpools, those are
just passthrough (ie used to pass threadpool to ggml_graph_compute())

As I mentioned above llama-bench needs to explicitly manage threadpool creation because cpu-mask and things are now vectors (ie test params) as you suggested earlier. I included some examples of the output above. It's really neat how it can be used to figure out the best CPU pinning. (unfortunately, this breaks compare-commits.sh for now because I added extra fields and sql tables don't match between branches).

Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms).
It looks like partial offload case (-ngl 10) on M2 Max is doing OK now.

We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp.

M2 Max

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" make -j

(venv) ~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model	size	params	backend	ngl	threads	test	t/s
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	0	4	pp512	9421.03 ± 97.37
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	0	4	tg128	817.88 ± 2.58
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	10	4	pp512	46534.56 ± 1708.85
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	10	4	tg128	1167.67 ± 8.06
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	99	4	pp512	46993.95 ± 1904.04
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	99	4	tg128	1169.23 ± 9.39

build: 3246fe8 (3637)

(venv) ~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model	size	params	backend	ngl	threads	test	t/s
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	0	4	pp512	9543.05 ± 50.75
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	0	4	tg128	1003.64 ± 4.13
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	10	4	pp512	47665.38 ± 1765.27
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	10	4	tg128	1165.35 ± 8.50
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	99	4	pp512	46802.22 ± 2089.11
llama ?B Q4_0	70.81 MiB	116.93 M	Metal	99	4	tg128	1162.56 ± 6.41

build: c6328bc (3677)

Ryzen 9 3950X + RTX 3080

GGML_CUDA=1 make -j
llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	threads	test	t/s
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	0	8	pp512	44574.85 ± 218.52
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	0	8	tg128	811.77 ± 4.67
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	99	8	pp512	144896.09 ± 446.62
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	99	8	tg128	1862.24 ± 56.18

build: 3246fe8 (3637)

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j
llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	threads	test	t/s
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	0	8	pp512	44386.72 ± 184.30
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	0	8	tg128	816.19 ± 3.35
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	99	8	pp512	144243.73 ± 363.10
llama ?B Q4_K - Medium	72.74 MiB	116.93 M	CUDA	99	8	tg128	1904.55 ± 64.01

build: c6328bc (3677)

Snapdragon X-Elite

~/src/llama.cpp-master
$ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	4	pp512	5345.36 ± 32.45
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	4	tg128	743.98 ± 26.26

build: 3246fe8 (3637)

~/src/llama.cpp-threadpool
$ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model	size	params	backend	threads	test	t/s
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	4	pp512	5457.88 ± 4.70
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	4	tg128	1006.58 ± 7.88

build: c6328bc (3677)

Snapdragon Gen 3

Default Android build: armv8.7-a + openmp
$ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh master llama-v2-115m.q4_0_4_8.gguf -t 6"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master'
./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model	size	params	backend	threads	mmap	test	t/s
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	6	0	pp512	3099.16 ± 2.12
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	6	0	tg128	614.70 ± 115.46

build: 3246fe8 (3637)

Performance counter statistics:

#            count  event_name                # count / runtime
    38,854,765,094  cpu-cycles                # 3.014607 GHz      
       357,578,565  stalled-cycles-frontend   # 27.743 M/sec      
    16,329,331,107  stalled-cycles-backend    # 1.267 G/sec       
    85,859,994,043  instructions              # 6.662 G/sec       
        11,675,349  branch-misses             # 905.850 K/sec     
  12888.605049(ms)  task-clock                # 5.478707 cpus used
               934  context-switches          # 72.467 /sec       
             8,267  page-faults               # 641.419 /sec

Default Android build: armv8.7-a + no-openmp
$ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh threadpool llama-v2-115m.q4_0_4_8.gguf -t 6"
export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool'
./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model	size	params	backend	threads	mmap	test	t/s
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	6	0	pp512	3108.87 ± 5.66
llama ?B Q4_0_4_8	70.81 MiB	116.93 M	CPU	6	0	tg128	750.33 ± 11.71

build: c6328bc (3677)

Performance counter statistics:

#            count  event_name                # count / runtime
    34,793,228,120  cpu-cycles                # 3.012136 GHz      
       325,677,417  stalled-cycles-frontend   # 28.195 M/sec      
    12,701,547,441  stalled-cycles-backend    # 1.100 G/sec       
    80,139,234,122  instructions              # 6.938 G/sec       
         8,967,312  branch-misses             # 776.323 K/sec     
  11550.896083(ms)  task-clock                # 5.611216 cpus used
               226  context-switches          # 19.566 /sec       
             7,976  page-faults               # 690.509 /sec

examples/llama-bench/llama-bench.cpp

slaren · 2024-08-29T01:55:34Z

ggml/include/ggml.h

+    enum ggml_sched_priority {
+        GGML_SCHED_PRIO_NORMAL,
+        GGML_SCHED_PRIO_MEDIUM,
+        GGML_SCHED_PRIO_HIGH,
+        GGML_SCHED_PRIO_REALTIME
+    };


Doesn't need to be done now, but it would be useful to have priorities below normal. I don't expect that increasing the priority of compute threads will be very useful outside of benchmarking, virtually every other thread is more important.

Sounds good. Main use-cases we wanted to enable are benchmarking, low-latency LLM response (ie using fewer cores but having the threads quickly get CPU cycles), also bumping priority a bit encourages Windows scheduler to place threads on the perf cores.
Will add lower priorities in threadpool V3.

ggml/src/ggml-backend.c

ggml/src/ggml.c

include/llama.h

ggml/include/ggml.h

Co-authored-by: slaren <slarengh@gmail.com>

max-krasnyansky · 2024-08-29T15:12:25Z

@mofosyne @ggerganov @slaren
Should be good to go now.

slaren · 2024-08-29T23:20:59Z

Good job!

max-krasnyansky · 2024-08-30T01:52:05Z

Good job!

Thank you thank you!
Super fun discussions. Thanks for you patience with reviews and testing.
I'm going to get started on the V3 :) std::thread, std::atomic will make the code even better.

ggerganov · 2024-08-30T08:00:55Z

Thank you for the great work and thorough review 👍

FranzKafkaYu · 2024-09-06T02:01:19Z

@slaren Most of your comments & suggestions have been addressed. GGML API has been further cleaned up and simplified. Theadpool switching is now transparent, we switch on ggml_backend_cpu_set_threadpool(). Threadpool params have nice defaults.

Process priority setting has been moved into a helper function in common/common.cpp and only called from the sample apps.

src/llama.cpp looks much simpler now. Just a few lines of extra code that selects the threadpool based on the number of tokens, same as selecting n_threads. And a couple of API calls to attach threadpools, those are just passthrough (ie used to pass threadpool to ggml_graph_compute())

As I mentioned above llama-bench needs to explicitly manage threadpool creation because cpu-mask and things are now vectors (ie test params) as you suggested earlier. I included some examples of the output above. It's really neat how it can be used to figure out the best CPU pinning. (unfortunately, this breaks compare-commits.sh for now because I added extra fields and sql tables don't match between branches).

Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms). It looks like partial offload case (-ngl 10) on M2 Max is doing OK now.

We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp.

M2 Max

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" make -j

(venv) ~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9421.03 ± 97.37
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 817.88 ± 2.58
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 46534.56 ± 1708.85
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1167.67 ± 8.06
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46993.95 ± 1904.04
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1169.23 ± 9.39
build: 3246fe8 (3637)

(venv) ~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99

model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9543.05 ± 50.75
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 1003.64 ± 4.13
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 47665.38 ± 1765.27
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1165.35 ± 8.50
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46802.22 ± 2089.11
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1162.56 ± 6.41
build: c6328bc (3677)

Ryzen 9 3950X + RTX 3080

GGML_CUDA=1 make -j llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44574.85 ± 218.52
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 811.77 ± 4.67
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144896.09 ± 446.62
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1862.24 ± 56.18
build: 3246fe8 (3637)

GGML_CUDA=1 GGML_NO_OPENMP=1 make -j llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44386.72 ± 184.30
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 816.19 ± 3.35
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144243.73 ± 363.10
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1904.55 ± 64.01
build: c6328bc (3677)

Snapdragon X-Elite

~/src/llama.cpp-master $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5345.36 ± 32.45
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 743.98 ± 26.26
build: 3246fe8 (3637)

~/src/llama.cpp-threadpool $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4

model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5457.88 ± 4.70
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 1006.58 ± 7.88
build: c6328bc (3677)

Snapdragon Gen 3

Default Android build: armv8.7-a + openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh master llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master' ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3099.16 ± 2.12
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 614.70 ± 115.46
build: 3246fe8 (3637)
Performance counter statistics:

#            count  event_name                # count / runtime
    38,854,765,094  cpu-cycles                # 3.014607 GHz      
       357,578,565  stalled-cycles-frontend   # 27.743 M/sec      
    16,329,331,107  stalled-cycles-backend    # 1.267 G/sec       
    85,859,994,043  instructions              # 6.662 G/sec       
        11,675,349  branch-misses             # 905.850 K/sec     
  12888.605049(ms)  task-clock                # 5.478707 cpus used
               934  context-switches          # 72.467 /sec       
             8,267  page-faults               # 641.419 /sec      
Default Android build: armv8.7-a + no-openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh threadpool llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool' ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6

model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3108.87 ± 5.66
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 750.33 ± 11.71
build: c6328bc (3677)
Performance counter statistics:

#            count  event_name                # count / runtime
    34,793,228,120  cpu-cycles                # 3.012136 GHz      
       325,677,417  stalled-cycles-frontend   # 28.195 M/sec      
    12,701,547,441  stalled-cycles-backend    # 1.100 G/sec       
    80,139,234,122  instructions              # 6.938 G/sec       
         8,967,312  branch-misses             # 776.323 K/sec     
  11550.896083(ms)  task-clock                # 5.611216 cpus used
               226  context-switches          # 19.566 /sec       
             7,976  page-faults               # 690.509 /sec      

excellent works!!!May I ask have you tried Android in arm board witch CPU backend,what is the performance?

fmz · 2024-09-06T14:45:25Z

@slaren Most of your comments & suggestions have been addressed. GGML API has been further cleaned up and simplified. Theadpool switching is now transparent, we switch on ggml_backend_cpu_set_threadpool(). Threadpool params have nice defaults.
Process priority setting has been moved into a helper function in common/common.cpp and only called from the sample apps.
src/llama.cpp looks much simpler now. Just a few lines of extra code that selects the threadpool based on the number of tokens, same as selecting n_threads. And a couple of API calls to attach threadpools, those are just passthrough (ie used to pass threadpool to ggml_graph_compute())
As I mentioned above llama-bench needs to explicitly manage threadpool creation because cpu-mask and things are now vectors (ie test params) as you suggested earlier. I included some examples of the output above. It's really neat how it can be used to figure out the best CPU pinning. (unfortunately, this breaks compare-commits.sh for now because I added extra fields and sql tables don't match between branches).
Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms). It looks like partial offload case (-ngl 10) on M2 Max is doing OK now.
We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp.

M2 Max

CC=clang CXX=clang++ CFLAGS="-march=armv8.7-a" CXXFLAGS="-march=armv8.7-a" make -j
(venv) ~/src/llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99
model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9421.03 ± 97.37
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 817.88 ± 2.58
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 46534.56 ± 1708.85
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1167.67 ± 8.06
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46993.95 ± 1904.04
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1169.23 ± 9.39
build: 3246fe8 (3637)
(venv) ~/src/llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_0.gguf -t 4 -r 20 -ngl 0,10,99
model size params backend ngl threads test t/s
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 pp512 9543.05 ± 50.75
llama ?B Q4_0 70.81 MiB 116.93 M Metal 0 4 tg128 1003.64 ± 4.13
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 pp512 47665.38 ± 1765.27
llama ?B Q4_0 70.81 MiB 116.93 M Metal 10 4 tg128 1165.35 ± 8.50
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 pp512 46802.22 ± 2089.11
llama ?B Q4_0 70.81 MiB 116.93 M Metal 99 4 tg128 1162.56 ± 6.41
build: c6328bc (3677)

Ryzen 9 3950X + RTX 3080

GGML_CUDA=1 make -j llama.cpp-master$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44574.85 ± 218.52
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 811.77 ± 4.67
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144896.09 ± 446.62
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1862.24 ± 56.18
build: 3246fe8 (3637)
GGML_CUDA=1 GGML_NO_OPENMP=1 make -j llama.cpp-threadpool$ ./llama-bench -m ../gguf/llama-v2-115m.q4_k_m.gguf -t 8 -ngl 0,99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl threads test t/s
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 pp512 44386.72 ± 184.30
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 0 8 tg128 816.19 ± 3.35
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 pp512 144243.73 ± 363.10
llama ?B Q4_K - Medium 72.74 MiB 116.93 M CUDA 99 8 tg128 1904.55 ± 64.01
build: c6328bc (3677)

Snapdragon X-Elite

~/src/llama.cpp-master $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4
model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5345.36 ± 32.45
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 743.98 ± 26.26
build: 3246fe8 (3637)
~/src/llama.cpp-threadpool $ ./build-arm64-windows-llvm-release/bin/llama-bench.exe -m ../gguf/llama-v2-115m.q4_0_4_8.gguf -t 4
model size params backend threads test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 pp512 5457.88 ± 4.70
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 4 tg128 1006.58 ± 7.88
build: c6328bc (3677)

Snapdragon Gen 3

Default Android build: armv8.7-a + openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh master llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/master' ./master/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6
model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3099.16 ± 2.12
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 614.70 ± 115.46
build: 3246fe8 (3637)
Performance counter statistics:

#            count  event_name                # count / runtime
    38,854,765,094  cpu-cycles                # 3.014607 GHz      
       357,578,565  stalled-cycles-frontend   # 27.743 M/sec      
    16,329,331,107  stalled-cycles-backend    # 1.267 G/sec       
    85,859,994,043  instructions              # 6.662 G/sec       
        11,675,349  branch-misses             # 905.850 K/sec     
  12888.605049(ms)  task-clock                # 5.478707 cpus used
               934  context-switches          # 72.467 /sec       
             8,267  page-faults               # 641.419 /sec      
Default Android build: armv8.7-a + no-openmp $ adb shell "cd /data/local/tmp/lmcp; simpleperf stat ./run-bench.sh threadpool llama-v2-115m.q4_0_4_8.gguf -t 6" export 'LD_LIBRARY_PATH=/data/local/tmp/lmcp/threadpool' ./threadpool/llama-bench --mmap 0 --n-gpu-layers 0 -m llama-v2-115m.q4_0_4_8.gguf -t 6
model size params backend threads mmap test t/s
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 pp512 3108.87 ± 5.66
llama ?B Q4_0_4_8 70.81 MiB 116.93 M CPU 6 0 tg128 750.33 ± 11.71
build: c6328bc (3677)
Performance counter statistics:

#            count  event_name                # count / runtime
    34,793,228,120  cpu-cycles                # 3.012136 GHz      
       325,677,417  stalled-cycles-frontend   # 28.195 M/sec      
    12,701,547,441  stalled-cycles-backend    # 1.100 G/sec       
    80,139,234,122  instructions              # 6.938 G/sec       
         8,967,312  branch-misses             # 776.323 K/sec     
  11550.896083(ms)  task-clock                # 5.611216 cpus used
               226  context-switches          # 19.566 /sec       
             7,976  page-faults               # 690.509 /sec      
excellent works!!!May I ask have you tried Android in arm board witch CPU backend,what is the performance?

That's the Sanpdragon 8 Gen 3 :)

* Introduce ggml_compute_threadpool - OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems * Minor fixes * fixed use after release bug * fixed a harmless race condition * Fix Android bulid issue * fix more race conditions * fix deadlock for cases where cgraph.n_nodes == 1 and fix --poll case * threadpool: use cpu_get_num_math to set the default number of threadpool threads This way we avoid using E-Cores and Hyperthreaded siblings. * bench: create fresh threadpool for each test For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc). * atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior. * threadpool: make polling the default to match openmp behavior All command line args now allow for setting poll to 0 (false). * threadpool: do not wakeup threads in already paused threadpool * fix potential race condition in check_for_work * threadpool: do not create two threadpools if their params are identical * threadpool: reduce pause/resume/wakeup overhead in common cases We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead. * threadpool: add support for hybrid polling poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve. * threadpool: reduce the number of barrier required New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs. * threadpool: remove special-casing for disposable threadpools With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking. * threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers. * threadpool: use relaxed order for chunk sync Full memory barrier is an overkill for this since each thread works on different chunk * threadpool: remove abort_callback from threadpool state * threadpool: better naming for thread/cpumask releated functions * threadpool: consistent use of int type for n_threads params * threadpool: add support for ggml_threadpool_params_default/init Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask. * threadpool: move typedef into ggml.h * threadpool: fix apply_priority() function name * threadpool: fix swift wrapper errors due to n_threads int type cleanup * threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled * threadpool: replace checks for compute_thread ret code with proper status check * threadpool: simplify threadpool init logic and fix main thread affinity application Most of the init code is now exactly the same between threadpool and openmp. * threadpool: update threadpool resume/pause function names * threadpool: enable openmp by default for now * threadpool: don't forget to free workers state when omp is enabled * threadpool: avoid updating process priority on the platforms that do not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings. * threadpool: update calling thread prio and affinity only at start/resume This avoids extra syscalls for each graph_compute() * llama-bench: turn threadpool params into vectors, add output headers, etc * llama-bench: add support for cool off between tests --delay This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test. * threadpool: move process priority setting into the apps (bench and cli) This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy. * threadpool: move all pause/resume logic into ggml * threadpool: futher api cleanup and prep for future refactoring All threadpool related functions and structs use ggml_threadpool prefix. * threadpool: minor indent fixes * threadpool: improve setprioty error message * Update examples/llama-bench/llama-bench.cpp Co-authored-by: slaren <slarengh@gmail.com> * threadpool: fix indent in set_threadpool call * use int32_t for n_thread type in public llama.cpp API * threadpool: use _new and _free instead of _create and _release * fix two more public APIs to use int32_t for n_threads * build: set _GNU_SOURCE for Adroid --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com> Co-authored-by: fmz <quic_fzaghlou@quic.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>

github-actions bot added testing Everything test related examples server ggml changes relating to the ggml tensor library for machine learning labels Jul 24, 2024

fmz force-pushed the threadpool branch 3 times, most recently from 043b9df to ef1ff14 Compare July 25, 2024 14:50

This comment was marked as spam.

Sign in to view

fmz force-pushed the threadpool branch from a7f4a5e to 0390df2 Compare July 27, 2024 12:38

fmz force-pushed the threadpool branch from 0390df2 to ed131c9 Compare July 31, 2024 16:46

fmz force-pushed the threadpool branch from ed131c9 to bebe99c Compare July 31, 2024 20:17

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 6, 2024

fmz force-pushed the threadpool branch 2 times, most recently from 4aa7a72 to 8ecdd36 Compare August 7, 2024 14:38

slaren reviewed Aug 27, 2024

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

threadpool: move process priority setting into the apps (bench and cli)

5d4c0a1

This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy.

max-krasnyansky force-pushed the threadpool branch from adcd24c to 0778628 Compare August 28, 2024 00:20

threadpool: move all pause/resume logic into ggml

e3c2202

max-krasnyansky force-pushed the threadpool branch from 0778628 to e3c2202 Compare August 28, 2024 00:20

threadpool: futher api cleanup and prep for future refactoring

c6328bc

All threadpool related functions and structs use ggml_threadpool prefix.

max-krasnyansky added 2 commits August 27, 2024 22:33

threadpool: minor indent fixes

bead7d4

threadpool: improve setprioty error message

8e8f8ce

slaren approved these changes Aug 29, 2024

View reviewed changes

max-krasnyansky and others added 6 commits August 28, 2024 20:54

Update examples/llama-bench/llama-bench.cpp

c6c27b1

Co-authored-by: slaren <slarengh@gmail.com>

threadpool: fix indent in set_threadpool call

b97bd67

use int32_t for n_thread type in public llama.cpp API

cae35b9

threadpool: use _new and _free instead of _create and _release

c49d634

fix two more public APIs to use int32_t for n_threads

3b5f7c2

build: set _GNU_SOURCE for Adroid

52aa677

slaren merged commit 42c76d1 into ggerganov:master Aug 29, 2024
52 checks passed

ggerganov mentioned this pull request Sep 3, 2024

changelog : libllama API #9289

Open

akx mentioned this pull request Sep 6, 2024

llama-bench : log benchmark progress #9287

Merged

4 tasks

ngxson mentioned this pull request Sep 6, 2024

ggml : fix missing cpu_set_t on emscripten #9336

Merged

4 tasks

ggerganov mentioned this pull request Sep 13, 2024

threadpool: skip polling for unused threads #9461

Merged

3 tasks

yagil mentioned this pull request Sep 21, 2024

LM-Studio 0.3.2 "clamps" the numbers of CPU cores from 16 to 8. lmstudio-ai/lmstudio-bug-tracker#130

Closed

ggerganov mentioned this pull request Oct 16, 2024

Use thread pool ggerganov/ggml#400

Closed

Threadpool: take 2 #8672

Threadpool: take 2 #8672

Conversation

fmz commented Jul 24, 2024 • edited Loading

fmz commented Jul 24, 2024 • edited Loading

fmz commented Jul 26, 2024

slaren commented Jul 26, 2024 • edited Loading

fmz commented Jul 26, 2024

fmz commented Jul 26, 2024

fmz commented Jul 26, 2024

This comment was marked as spam.

fmz commented Jul 29, 2024

slaren commented Jul 31, 2024

fmz commented Jul 31, 2024

fmz commented Jul 31, 2024

slaren commented Aug 1, 2024

fmz commented Aug 1, 2024

max-krasnyansky commented Aug 3, 2024 • edited Loading

max-krasnyansky commented Aug 4, 2024

slaren commented Aug 8, 2024 • edited Loading

max-krasnyansky commented Aug 9, 2024 • edited Loading

max-krasnyansky commented Aug 28, 2024

M2 Max

Ryzen 9 3950X + RTX 3080

Snapdragon X-Elite

Snapdragon Gen 3

slaren Aug 29, 2024

Choose a reason for hiding this comment

max-krasnyansky Aug 29, 2024

Choose a reason for hiding this comment

max-krasnyansky commented Aug 29, 2024

slaren commented Aug 29, 2024

max-krasnyansky commented Aug 30, 2024

ggerganov commented Aug 30, 2024

FranzKafkaYu commented Sep 6, 2024

M2 Max

Ryzen 9 3950X + RTX 3080

Snapdragon X-Elite

Snapdragon Gen 3

fmz commented Sep 6, 2024

M2 Max

Ryzen 9 3950X + RTX 3080

Snapdragon X-Elite

Snapdragon Gen 3

fmz commented Jul 24, 2024 •

edited

Loading

fmz commented Jul 24, 2024 •

edited

Loading

slaren commented Jul 26, 2024 •

edited

Loading

max-krasnyansky commented Aug 3, 2024 •

edited

Loading

slaren commented Aug 8, 2024 •

edited

Loading

max-krasnyansky commented Aug 9, 2024 •

edited

Loading