merge from llama.cpp #33

l3utterfly · 2024-08-12T02:54:39Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`

ggml-ci

* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

…8937)

This commit adds the `--pooling` option to the README.md file in the `examples/embedding` directory. The motivation for adding this options is that currently if the model used does not specify a pooling type the embedding example will fail with the following error message: ```console main: error: pooling type NONE not supported ``` This commit also updates the name of the executable in the examples section.

* ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> --------- Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

* init * rename * add run android for termux in readme * add android readme * add instructions in readme * change name in readme * Update README.md * fixed line * add result in readme * random pos_embed * add positions index * change for ollama * change for ollama * better pos_embed in clip * support ollama * updata cmakelist * updata cmakelist * rename wrapper * clear code * replace and organize code * add link * sync master * fix warnings * fix warnings * fix bug in bicubic resize when need resize iamge smaller * receive review comments and modify * receive review comments and modify * put all code into llava dir * fix quality problem in pr code * change n_layer * add space in "-1" * imitate reshape bug of python code * fix bug in clip * fix issues for merging * fix llama-minicpmv-cli in cmake file * change pr readme * fix code review * remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir * fix cmakefile * add warn * fix KEY_HAS_MINICPMV_PROJ * remove load_image_size into clip_ctx * remove the extern "C", MINICPMV_API * fix uhd code for review comment * delete minicpmv-wrapper in pr * remove uhd_image_embed * Modify 2 notes * clip : style changes * del common.h in clip * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix makefile error * fix ubuntu-make error * try fix clip * try fix 1 --------- Co-authored-by: Hongji Zhu <fireyoucan@gmail.com> Co-authored-by: harvestingmoon <leewenyeong@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci

ggml-ci

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Signed-off-by: tarilabs <matteo.mortari@gmail.com>

* gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* default n_swa for phi-3 * fix * double check swa

…ronization overhead. (ggerganov#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <picard12@live.de>

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Co-authored-by: Neo Zhang <>

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

merge upstream

DrDub and others added 29 commits August 8, 2024 11:44

make : clean llamafile objects (ggerganov#8923)

ebd541a

`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`

metal : add abort callback (ggml/905)

85fca8d

metal : fix struct name (ggml/912)

5b33ea1

ggml-ci

ggml : ignore more msvc warnings (ggml/906)

f93d49a

sync : ggml

e44a561

scripts : fix sync filenames (#0)

366d486

scripts : sync cann files (#0)

afd27f0

llama : reduce useless copies when saving session (ggerganov#8916)

345a686

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

server : add one level list nesting for embeddings (ggerganov#8936)

daef3ab

llama : fix typo in llama_tensor_get_type comment [no ci] (ggerganov#…

6f6496b

…8937)

sync : ggml

4305b57

llama : better replace_all (cont) (ggerganov#8926)

45a55b9

* llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci

make : fix llava obj file race (ggerganov#8946)

272e3bd

ggml-ci

llama : add support for lora adapters in T5 model (ggerganov#8938)

6afd1a9

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Merge commit from fork

b72942f

gguf-py : fix double call to add_architecture() (ggerganov#8952)

911b437

Signed-off-by: tarilabs <matteo.mortari@gmail.com>

llama : default n_swa for phi-3 (ggerganov#8931)

7eb2384

* default n_swa for phi-3 * fix * double check swa

metal : fix uninitialized abort_callback (ggerganov#8968)

6e02327

llama : check all graph nodes when searching for result_embd_pooled (g…

33309f6

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

update guide (ggerganov#8909)

a21c6fd

Co-authored-by: Neo Zhang <>

flake.lock: Update (ggerganov#8979)

8cd1bcf

gguf-py : Numpy dequantization for most types (ggerganov#8939)

4134999

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

Merge pull request #32 from ggerganov/master

32335d5

merge upstream

l3utterfly merged commit 260527b into layla-build Aug 12, 2024
60 of 75 checks passed

github-actions bot added documentation Improvements or additions to documentation SYCL Vulkan examples python server ggml Apple Metal script labels Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from llama.cpp #33

merge from llama.cpp #33

l3utterfly commented Aug 12, 2024

merge from llama.cpp #33

merge from llama.cpp #33

Conversation

l3utterfly commented Aug 12, 2024