Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml #2573

Merged
merged 41 commits into from
Nov 20, 2024
Merged

sync : ggml #2573

merged 41 commits into from
Nov 20, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 19, 2024

TODO:

  • fix examples
  • start using backend registry
  • update Makefile

ggerganov and others added 30 commits November 19, 2024 18:59
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
…a/9921)

* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692d61360b46954a1c7f780bd2e569b73.
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
* ggml : remove duplicated sources from the last sync

ggml-ci

* cont : remove FindSIMD.cmake [no ci]
* ggml: new optimization interface

remove test2.c, test3.c

store adamw params in tensor

move grads from tensor to graph

* avoid segfault upon API misuse

* add ggml-opt.h to public headers

* remove dependence of ggml-opt.cpp on ggml-cpu.h
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
* metal : add kernel arg structs (wip)

* metal : fattn args

ggml-ci

* metal : cont + avoid potential int overflow [no ci]

* metal : mul mat struct (wip)

* cont : mul mat vec

* cont : pass by reference

* cont : args is first argument

* cont : use char ptr

* cont : shmem style

* cont : thread counters style

* cont : mul mm id

ggml-ci

* cont : int safety + register optimizations

ggml-ci

* metal : GGML_OP_CONCAT

ggml-ci

* metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV

* metal : GGML_OP_REPEAT

* metal : GGML_OP_CPY

* metal : GGML_OP_RMS_NORM

* metal : GGML_OP_NORM

* metal : add TODOs for rest of ops

* ggml : add ggml-metal-impl.h

ggml-ci
* Vulkan: Fix device info output format specifiers

* Vulkan: Use zu printf specifier for size_t instead of ld
-- While running StableDiffusion.cpp locally with Metal some offsets overflow and results in incorrect calculations
Seems like this isn't working for vulkan-over-metal when the array is sized
by a spec constant. Maybe a spirv-cross limitation?
Alcpz and others added 9 commits November 19, 2024 19:02
* vulkan: Optimize soft_max

Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.

* vulkan: Further soft_max optimizations

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.
…/10266)

* Add option to set the SYCL architecture for all targets
* Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option
* Document that setting GGML_SYCL_ARCH can improve the performance
@ggerganov
Copy link
Owner Author

@slaren I am getting the following assertion after the sync:

make -j && ./main -m models/ggml-base.bin -f samples/jfk.wav
whisper_backend_init: using BLAS backend
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 14.01 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.92 MiB
whisper_init_state: compute buffer (conv)   =   17.22 MB
Assertion failed: (src_backend_id != -1), function ggml_backend_sched_split_graph, file ggml-backend.cpp, line 1165.
Process 66825 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x0000000100048950 main`ggml_backend_sched_split_graph(sched=0x000000013890f400, graph=0x0000000148648020) at ggml-backend.cpp:1165:17
   1162	
   1163	               size_t src_id = hash_id(src);
   1164	               const int src_backend_id = sched->hv_tensor_backend_ids[src_id];
-> 1165	               assert(src_backend_id != -1); // all inputs should be assigned by now
   1166	
   1167	               if (src->flags & GGML_TENSOR_FLAG_INPUT && sched->n_copies > 1) {
   1168	                   if (tensor_id_copy(src_id, src_backend_id, 0) == NULL) {

The problem seems to be that the "encode" scheduler does not know about the embd_conv tensor which is the result of the previous "conv" graph. What would be the recommended way to fix this? I think I can copy the data embd_conv data to host memory after the "conv" graph and then copy it back to device memory before calling the "encode" graph. But I wonder if this copy can be avoided.

@slaren
Copy link
Collaborator

slaren commented Nov 20, 2024

Should be fixed now, sorry about that.

@ggerganov
Copy link
Owner Author

Nice, thank you!

@ggerganov
Copy link
Owner Author

@KitaitiMakoto With this PR, the ggml source tree has changed a bit and the Ruby bindings need to be adapted respectively. I'll leave them for now in a broken state, but you can make a PR either to this branch or later to master to resolve the build. Thanks.

@KitaitiMakoto
Copy link
Contributor

Okay, I will make a pull request to master after this pull request will be merged. Thank you for mentioning me.

@ggerganov ggerganov marked this pull request as ready for review November 20, 2024 13:57
@ggerganov
Copy link
Owner Author

I can't figure out why the whisper.objc CI is failing to use the correct include path. Locally, it runs successfully.

@ggerganov ggerganov merged commit 37c8802 into master Nov 20, 2024
85 of 89 checks passed
@ggerganov ggerganov deleted the sync branch November 20, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.