merge upstream #44

l3utterfly · 2024-11-05T07:31:27Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml-ci

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18) → 'github:NixOS/nixpkgs/2768c7d042a37de65bb1b5b3268fc987e534c49d?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add granite template to llama.cpp * Add granite template to test-chat-template.cpp * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Update tests/test-chat-template.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Added proper template and expected output * Small change to \n Small change to \n * Add code space & Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * Fix spacing * Apply suggestions from code review * Update src/llama.cpp --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

ggml-ci

… MobileVLM model. (ggerganov#9763) * ggml: Add POOL2D OP for GPU ACC to the Vulkan. - The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend. - A GGML_OP_POOL_2D shader has been added. (Pooling) - The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Correct the incorrect order of the parameters. fix casting to int. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> --------- Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* ggml : RISC-V vector gemv for q4_0_8x8 * ggml : Added WIP rvv q4_0_8x8 gemm * ggml : Added initial implementation of rvv gemm * ggml : optimize gemm to avoid register spillover * ggml : Fix GCC rvv load alignment issue * ggml : Format gemm rvv code * ggml : Fix a typo in RVV q4_0_8_8 GEMM

) * ggml : fix gguf string leak when reading kv pairs fails * ggml : avoid crashing with GGML_ABORT when the KV has an invalid type * ggml : avoid crashing on failed memory allocations when loading a gguf file

Get in line with the other backends by supporting the newer backend/device registry interfaces. Signed-off-by: Sergio Lopez <slp@redhat.com>

This is a more or less direct translation from the Metal implementation to GLSL. Signed-off-by: Sergio Lopez <slp@redhat.com>

* loader: refactor tensor weights storage * use sorted map, sort weights by layer --------- Co-authored-by: slaren <slarengh@gmail.com>

* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE

)

ggml-ci

* Fix smart selection of available slot * minor fix * replace vectors of tokens with shorthands

* llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* convert-lora : make `--base` optional * lint * handle case where base_model_name_or_path is invalid * do not include metadata from base model * clarify unspecified --base * add small comment [no ci] * trigger ci

* Add apple arm to presets * Add final new line

* metal : minor fixup in FA kernel ggml-ci * metal : use the unrolled loop variable * metal : remove unused var

remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.

This fixes the build break from the recent changes to move the CPU backend to separate files ggerganov#10144

* server : clarify /slots endpoint, add is_processing * fix tests

…ganov#10167)

* q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86

ggerganov and others added 30 commits October 27, 2024 20:59

llama : switch KQ multiplication to F32 precision by default (ggergan…

8841ce3

…ov#10015) ggml-ci

server : don't overfill the batch during infill (ggerganov#10018)

8125e6c

ggml-ci

musa: workaround for Guilty Lockup in cleaning src0 (ggerganov#10042)

524afee

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

llama : remove Tail-Free sampling (ggerganov#10071)

8d8ff71

ggml-ci

llama : refactor model loader with backend registry (ggerganov#10026)

c5b0f4b

convert : more detailed convert lora usage docs (ggerganov#10065)

79a2bc0

readme : more lora detail in main example readme (ggerganov#10064)

6763f71

ggml : fix memory leaks when loading invalid gguf files (ggerganov#10094

b9e02e8

) * ggml : fix gguf string leak when reading kv pairs fails * ggml : avoid crashing with GGML_ABORT when the KV has an invalid type * ggml : avoid crashing on failed memory allocations when loading a gguf file

kompute: add backend registry / device interfaces (ggerganov#10045)

61408e7

Get in line with the other backends by supporting the newer backend/device registry interfaces. Signed-off-by: Sergio Lopez <slp@redhat.com>

kompute: add mul_mat_q4_k shader (ggerganov#10097)

1329c0a

This is a more or less direct translation from the Metal implementation to GLSL. Signed-off-by: Sergio Lopez <slp@redhat.com>

ggml : check tensor name lengths in gguf files (ggerganov#10100)

dea5e86

server : include scheme when printing URL (ggerganov#10106)

0a683e8

loader: refactor tensor weights storage (ggerganov#9935)

ab3d71f

* loader: refactor tensor weights storage * use sorted map, sort weights by layer --------- Co-authored-by: slaren <slarengh@gmail.com>

llama : fix buffer checks for mamba and rwk (ggerganov#10111)

c02e5ab

* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE

quantize : fix --keep-split (ggerganov#10114)

1e9f949

llama : improve output buffer type selection (ggerganov#10098)

85679d3

build: fix build error in Windows env with OneAPI setup (ggerganov#10107

e597e50

)

ggml : alloc ggml_contexts on the heap (whisper/2525)

f221d56

sync : ggml

815fe72

ggml : remove ggml_scratch (ggerganov#10121)

1804adb

ggml-ci

server : fix smart selection of available slot (ggerganov#10120)

d865d14

* Fix smart selection of available slot * minor fix * replace vectors of tokens with shorthands

readme : update hot topics

ba6f62e

vulkan : improve ggml_vk_create_buffer error handling (ggerganov#9898)

418f5ee

llama : use smart pointers for ggml resources (ggerganov#10117)

e991e31

llama : add simple-chat example (ggerganov#10124)

a6744e4

* llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

convert-lora : make --base optional (ggerganov#10110)

7554aa4

* convert-lora : make `--base` optional * lint * handle case where base_model_name_or_path is invalid * do not include metadata from base model * clarify unspecified --base * add small comment [no ci] * trigger ci

kohnech and others added 17 commits November 2, 2024 15:35

Add apple arm to presets (ggerganov#10134)

9830b69

* Add apple arm to presets * Add final new line

flake.lock: Update (ggerganov#10146)

1839f69

metal : minor fixup in FA kernel (ggerganov#10143)

08828a6

* metal : minor fixup in FA kernel ggml-ci * metal : use the unrolled loop variable * metal : remove unused var

ggml : move CPU backend to a separate file (ggerganov#10144)

9f40989

metal : fix minor string leaks (ggml/1004)

e2292aa

cmake : make it possible linking ggml as external lib (ggml/1003)

284e5b0

sync : ggml

ce027ad

CANN: adjust backend registry refactor. (ggerganov#10158)

329ed91

remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.

metal : move dequantize templates to beginning of MSL source (#0)

f8e5813

metal : simplify f16 and f32 dequant kernels (#0)

05697f6

cuda : clear error after changing peer access (ggerganov#10153)

ea02c75

fix build break on arm64 linux (ggerganov#10166)

6a066b9

This fixes the build break from the recent changes to move the CPU backend to separate files ggerganov#10144

server : clarify /slots endpoint, add is_processing (ggerganov#10162)

9e0ecfb

* server : clarify /slots endpoint, add is_processing * fix tests

ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (gger…

401558b

…ganov#10167)

ggml : fix gelu tables initialization (ggerganov#10172)

d5a409e

ggml : fix arch check in bf16_to_fp32 (ggerganov#10164)

a9e8a9a

l3utterfly merged commit 7a19c33 into layla-build Nov 5, 2024
68 checks passed

github-actions bot added SYCL Nvidia GPU Vulkan testing build examples devops python server ggml Kompute script labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #44

merge upstream #44

l3utterfly commented Nov 5, 2024

merge upstream #44

merge upstream #44

Conversation

l3utterfly commented Nov 5, 2024