Releases · ggerganov/llama.cpp

25 Sep 14:36

1e43630

b3825 Latest

Latest

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream

Assets 22

cudart-llama-bin-win-cu11.7.1-x64.zip

293 MB 2024-09-25T14:36:45Z
cudart-llama-bin-win-cu12.2.0-x64.zip

413 MB 2024-09-25T14:36:55Z
llama-b1-bin-win-hip-x64-gfx1030.zip

236 MB 2024-09-25T14:37:21Z
llama-b1-bin-win-hip-x64-gfx1100.zip

238 MB 2024-09-25T14:37:29Z
llama-b1-bin-win-hip-x64-gfx1101.zip

238 MB 2024-09-25T14:37:37Z
llama-b3825-bin-macos-arm64.zip

55.5 MB 2024-09-25T14:37:45Z
llama-b3825-bin-macos-x64.zip

55.4 MB 2024-09-25T14:37:48Z
llama-b3825-bin-ubuntu-x64.zip

60.2 MB 2024-09-25T14:37:51Z
llama-b3825-bin-win-avx-x64.zip

8.01 MB 2024-09-25T14:37:53Z
llama-b3825-bin-win-avx2-x64.zip

8.01 MB 2024-09-25T14:37:54Z
Source code (zip)

2024-09-25T13:12:20Z
Source code (tar.gz)

2024-09-25T13:12:20Z

25 Sep 14:09

github-actions

b3824

afbbfaa

b3824

server : add more env vars, improve gen-docs (#9635)

* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT

Assets 22

25 Sep 08:44

github-actions

b3823

3d6bf69

b3823

llama : add IBM Granite MoE architecture (#9438)

* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: alex.brooks@ibm.com

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Typo fix in docstring

Co-Authored-By: ggerganov@gmail.com

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Assets 22

25 Sep 04:55

github-actions

b3822

904837e

b3822

cann: fix crash when llama-bench is running on multiple cann devices …

…(#9627)

Assets 22

24 Sep 10:20

github-actions

b3821

70392f1

b3821

ggml : add AVX512DQ requirement for AVX512 builds (#9622)

Assets 22

24 Sep 09:50

github-actions

b3820

bb5f819

b3820

sync : ggml

Assets 22

24 Sep 09:50

github-actions

b3818

31ac583

b3818

llama : keep track of all EOG tokens in the vocab (#9609)

ggml-ci

Assets 22

24 Sep 09:26

github-actions

b3817

cea1486

b3817

log : add CONT level for continuing previous log entry (#9610)

Assets 22

24 Sep 07:05

github-actions

b3816

0aa1501

b3816

server : add newline after chat example (#9616)

Assets 22

24 Sep 05:37

github-actions

b3814

c087b6f

b3814

threads: fix msvc build without openmp (#9615)

We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.

Assets 22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggerganov/llama.cpp

b3825

b3824

b3823

b3822

b3821

b3820

b3818

b3817

b3816

b3814