-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafile : improve sgemm.cpp #6796
Conversation
- Re-enable by default - Fix issue described in ggerganov#6716 - Make code more abstract, elegant, and maintainable - Faster handling of weirdly shaped `m` an `n` edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build fails on Mac:
$ make -j
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion
I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY
I NVCCFLAGS: -std=c++11 -O3
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC: Apple clang version 15.0.0 (clang-1500.1.0.2.5)
I CXX: Apple clang version 15.0.0 (clang-1500.1.0.2.5)
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY -c sgemm.cpp -o sgemm.o
sgemm.cpp:515:13: error: unknown type name 'D'
D Cv[RN][RM] = {};
^
sgemm.cpp:516:41: error: use of undeclared identifier 'KN'
for (int l = 0; l < k; l += KN)
^
2 errors generated.
make: *** [sgemm.o] Error 1
Also, apply the following patch to enable LLAMAFILE with CMake builds:
diff --git a/CMakeLists.txt b/CMakeLists.txt
index f134a153..58a1805b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -43,17 +43,11 @@ else()
set(LLAMA_METAL_DEFAULT OFF)
endif()
-# TODO: fix this for Android CI
-# https://github.com/ggerganov/llama.cpp/pull/6716#issuecomment-2061509191
-#if (CMAKE_SYSTEM_NAME MATCHES "ANDROID")
-# set(LLAMA_LLAMAFILE_DEFAULT OFF)
-#else()
-# set(LLAMA_LLAMAFILE_DEFAULT ON)
-#endif()
-
-# TODO: temporary disable until MoE is fixed
-# https://github.com/ggerganov/llama.cpp/pull/6716
-set(LLAMA_LLAMAFILE_DEFAULT OFF)
+if (CMAKE_SYSTEM_NAME MATCHES "ANDROID")
+ set(LLAMA_LLAMAFILE_DEFAULT OFF)
+else()
+ set(LLAMA_LLAMAFILE_DEFAULT ON)
+endif()
# general
option(BUILD_SHARED_LIBS "build shared libraries" OFF)
@ggerganov Fixed. PTAL Please give Do you know if there's a better way to fix the Android issue? IIUC it's due to an instruction not being available on 32-bit ARM. Is there a way we could solve that with an |
@ggerganov Could you advise me on how I might bring the benefits of |
On make clean && LLAMA_NO_METAL=1 make -j && ./llama-bench -m models/mistral-7b-v0.2/ggml-model-fp16.gguf -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf -m models/mistral-7b-v0.2/ggml-model-q4_0.gguf -ngl 0 -n 0
build: 8960fe8 (2713) With this PR without make clean && LLAMA_NO_ACCELERATE=1 LLAMA_NO_METAL=1 make -j && ./llama-bench -m models/mistral-7b-v0.2/ggml-model-fp16.gguf -m models/mistral-7b-v0.2/ggml-model-q8_0.gguf -m models/mistral-7b-v0.2/ggml-model-q4_0.gguf -ngl 0 -n 0
build: 6b220dc (2704) So for me, F16 is faster now, Q8_0 is the same and Q4_0 is slower. Btw, I've looked some more, and I think the proper call in diff --git a/ggml.c b/ggml.c
index e3356bdb..086db96a 100644
--- a/ggml.c
+++ b/ggml.c
@@ -10878,15 +10878,13 @@ UseGgmlGemm1:;
const size_t row_size = ggml_row_size(vec_dot_type, ne10);
#if GGML_USE_LLAMAFILE
- if (src1_cont) {
+ if (src1->type != vec_dot_type) {
for (int64_t i13 = 0; i13 < ne13; i13++)
for (int64_t i12 = 0; i12 < ne12; i12++)
if (!llamafile_sgemm(ne01, ne11, ne00/ggml_blck_size(src0->type),
(const char *)src0->data + i12/r2*nb02 + i13/r3*nb03,
nb01/ggml_type_size(src0->type),
- (const char *)wdata + ggml_row_size(vec_dot_type,
- nb12/ggml_type_size(src1->type)*i12 +
- nb13/ggml_type_size(src1->type)*i13),
+ (const char *)wdata + (i12*ne11 + i13*ne12*ne11)*row_size,
row_size/ggml_type_size(vec_dot_type),
(char *)dst->data + i12*nb2 + i13*nb3,
nb1/ggml_type_size(dst->type), Regarding Android: in Lines 291 to 307 in e931888
We should not repeat the same implementation twice - we have to see if we can reuse it. I also don't have setup Android builds and it takes me some time to get the build running. So for now, let's focus on fixing the SGEMM and later we can think about improving Android support Let me think some time about |
Review comments addressed. PTAL. Agree on Android 32-bit.
I studied the code for hours and managed to figure it out. I've got a |
These changes cause failed assertions when running Cohere's Command R+ model:
When I reverted commit 192090b the problem disappeared. Log files attached. |
Here are instruction to trigger this assert: # convert to GGUF
python3 convert-hf-to-gguf.py ~/Data/huggingface/c4ai-command-r-plus/ --outfile models/command-r-plus/ggml-model-f16.gguf --outtype f16
# quantize to Q8_0 + F16 token embeddings
make -j
./quantize --token-embedding-type f16 ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q8_0.gguf q8_0
# build in DEBUG and run
make clean
LLAMA_DEBUG=1 LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 make -j
./main -m ./models/command-r-plus/ggml-model-q8_0.gguf |
I just tried a naive solution and replaced all ints in sgemm.cpp and sgemm.h with int64_t, and the resulting code works fine without any performance penalty (at least on my Epyc Genoa). Also there are no more crashes due to int overflow in pointer calculations when using Command R+. By the way, @jart thank you for these changes, they improved the prompt eval time on my system by 65% on llama-3 70B Q8! |
Thanks for the inbox bump. Making this my top priority now. Expect a PR shortly. |
m
ann
edge cases