sync : ggml #2444

ggerganov · 2024-10-02T12:15:17Z

No description provided.

a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

…/961)

…(llama/9627)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

…ma/9251)

* ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section

ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>

* ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg

When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

ggerganov and others added 18 commits October 2, 2024 15:11

ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)

443678a

vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml…

45c860f

…/961)

vulkan : multithread pipeline creation (ggml/963)

ec8e919

CUDA: remove bad assert (ggml/972)

44e2d39

cann: fix crash when llama-bench is running on multiple cann devices …

d2eac9f

…(llama/9627)

mtgpu: enable VMM (llama/9597)

0448551

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Enable use to the rebar feature to upload buffers to the device. (lla…

fd5cb2b

…ma/9251)

ggml : define missing HWCAP flags (llama/9684)

9d176ca

ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>

ggml: fix gradient allocation logic (ggml/966)

034ed81

* ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg

ggml : fix ggml_cast (ggml/973)

ee8e29c

test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)

2eda43a

sync : ggml

fce227e

metal : reduce command encoding overhead (llama/9698)

f083908

talk-llama : sync llama.cpp

b4c9631

ggerganov merged commit ccc2547 into master Oct 3, 2024
87 checks passed

ggerganov deleted the sync branch October 3, 2024 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #2444

sync : ggml #2444

ggerganov commented Oct 2, 2024

sync : ggml #2444

sync : ggml #2444

Conversation

ggerganov commented Oct 2, 2024