llama.vim : plugin for Neovim #9787

ggerganov · 2024-10-08T11:27:13Z

ref ggml-org/p1#1

The plugin is now developed here: https://github.com/ggml-org/llama.vim

Overview

Add a simple Neovim plugin for local LLM-assisted code/text completion.

Features

Auto-suggest on cursor movement in Insert mode
Toggle the suggestion manually by pressing Ctrl+F
Accept a suggestion with Tab
Accept the first line of a suggestion with Shift+Tab
Control max text generation time
Configure scope of context around the cursor
Ring context with chunks from open and edited files and yanked text
Supports very large contexts even on low-end hardware via smart context reuse
Display performance stats

Usage

Setup a llama-server instance with a FIM-compatible model (RoPE required). For example:

llama-server \
    --hf-repo ggerganov/Qwen2.5-Coder-1.5B-Q8_0-GGUF \
    --hf-file qwen2.5-coder-1.5b-q8_0.gguf \
    --port 8012 -ngl 99 -fa -ub 1024 -b 1024 -dt 0.1 \
    --cache-reuse 256

works best with Qwen2.5-Coder models (not Instruct)

Copy or symlink examples/llama.vim to ~/.config/nvim/autoload/llama.vim
Start Neovim and run:
```
:call llama#init()
```

For more advanced options, check the parameters in g:llama_config in examples/llama.vim:

llama.cpp/examples/llama.vim

Lines 43 to 86 in acf6d19

    
           " general parameters: 
        
           " 
        
           "   endpoint:         llama.cpp server endpoint 
        
           "   n_prefix:         number of lines before the cursor location to include in the prefix 
        
           "   n_suffix:         number of lines after  the cursor location to include in the suffix 
        
           "   n_predict:        max number of tokens to predict 
        
           "   t_max_prompt_ms:  max alloted time for the prompt generation (TODO: not yet supported) 
        
           "   t_max_predict_ms: max alloted time for the prediction 
        
           "   show_info:        show extra info about the inference (0 - disabled, 1 - statusline, 2 - inline) 
        
           "   auto_fim:         trigger FIM completion automatically on cursor movement 
        
           "   max_line_suffix:  do not auto-trigger FIM completion if there are more than this number of characters to the right of the cursor 
        
           " 
        
           " ring buffer of chunks, accumulated with time upon: 
        
           " 
        
           "  - completion request 
        
           "  - yank 
        
           "  - entering a buffer 
        
           "  - leaving a buffer 
        
           "  - writing a file 
        
           " 
        
           " parameters for the ring-buffer with extra context: 
        
           " 
        
           "   ring_n_chunks:    max number of chunks to pass as extra context to the server (0 to disable) 
        
           "   ring_chunk_size:  max size of the chunks (in number of lines) 
        
           "                     note: adjust these numbers so that you don't overrun your context 
        
           "                           at ring_n_chunks = 64 and ring_chunk_size = 64 you need ~32k context 
        
           "   ring_scope:       the range around the cursor position (in number of lines) for gathering chunks after FIM 
        
           "   ring_update_ms:   how often to process queued chunks in normal mode 
        
           " 
        
           let s:default_config = { 
        
               \ 'endpoint':         'http://127.0.0.1:8012/infill', 
        
               \ 'n_prefix':         256, 
        
               \ 'n_suffix':         8, 
        
               \ 'n_predict':        64, 
        
               \ 't_max_prompt_ms':  500, 
        
               \ 't_max_predict_ms': 200, 
        
               \ 'show_info':        2, 
        
               \ 'auto_fim':         v:true, 
        
               \ 'max_line_suffix':  8, 
        
               \ 'ring_n_chunks':    64, 
        
               \ 'ring_chunk_size':  64, 
        
               \ 'ring_scope':       1024, 
        
               \ 'ring_update_ms':   1000, 
        
               \ }

Sample configs based on hardware

High-end hardware with GPU

# llama-server: 7B LLM or above

--batch 2048
--flash-attn

# llama.vim:

g:llama_config.ring_n_chunks   = 64
g:llama_config.ring_chunk_size = 64

Mid-end hardware with GPU

# llama-server: 1.5B or 7B LLM

--batch [512, 1024]
--ctx-size [8192, 32768]
--flash-attn

# llama.vim:

g:llama_config.ring_n_chunks   = [32, 64]
g:llama_config.ring_chunk_size = [32, 64]

Low-end hardware with GPU

# llama-server: 1.5B LLM

--batch [512, 1024]
--ctx-size [2048, 8192]
--flash-attn

# llama.vim:

g:llama_config.ring_n_chunks   = [4, 16]
g:llama_config.ring_chunk_size = [16, 32]

Low-end hardware (CPU only)

# llama-server: 1.5B LLM

--batch [256, 512]
--ctx-size [1024, 4096]

# llama.vim:

g:llama_config.ring_n_chunks   = [0, 8]
g:llama_config.ring_chunk_size = [16, 32]

Backend changes

Debugging

Start llama-server .. -lv 1
Enable GGML_DEBUG_SAMPLER_INFILL in llama-sampling.cpp

Technical details

The plugin uses the /infill endpoint of the llama-server. It sends asynchronous FIM requests to the server via the curl tool:

let l:request = json_encode({
    \ 'input_prefix':     l:prefix,
    \ 'input_suffix':     l:suffix,
    \ 'input_extra':      l:extra_context,
    \ 'prompt':           l:prompt,
    \ 'n_predict':        g:llama_config.n_predict,
    \ 'n_indent':         l:indent,
    \ 'top_k':            40,
    \ 'top_p':            0.99,
    \ 'stream':           v:false,
    \ 'samplers':         ["top_k", "top_p", "infill"],
    \ 'cache_prompt':     v:true,
    \ 't_max_prompt_ms':  g:llama_config.t_max_prompt_ms,
    \ 't_max_predict_ms': g:llama_config.t_max_predict_ms
    \ })

let l:curl_command = printf(
    \ "curl --silent --no-buffer --request POST --url %s --header \"Content-Type: application/json\" --data %s",
    \ g:llama_config.endpoint, shellescape(l:request)
    \ )

The "input_prefix" and "input_suffix" are constructed by picking nearby lines around the cursor location:

let s:pos_x = col('.') - 1
let s:pos_y = line('.')

let l:lines_prefix = getline(max([1, s:pos_y - g:llama_config.n_prefix]), s:pos_y - 1)
let l:lines_suffix = getline(s:pos_y + 1, min([line('$'), s:pos_y + g:llama_config.n_suffix]))

let l:prefix = ""
    \ . join(l:lines_prefix, "\n")

let s:line_cur_suffix = strpart(s:line_cur, s:pos_x)

let l:suffix = ""
    \ . s:line_cur_suffix
    \ . "\n"
    \ . join(l:lines_suffix, "\n")
    \ . "\n"

The "prompt" is set as the text to the left of the cursor on the current line:

let s:line_cur_prefix = strpart(s:line_cur, 0, s:pos_x)

let l:prompt = ""
    \ . s:line_cur_prefix
    \ . "\n"

So far this is very a standard FIM completion using "local" context. Adding more and more context will usually improve the quality of the completion, but it will also increase the latency. As a datapoint, consider that a 7B LLM running on a 76 core M2 Ultra GPU roughly takes ~1 second to process 1000 tokens of context. Modern LLMs have training contexts of more than 32k tokens, so filling the entire context with local context and reprocessing it on each completion request is obviously not feasible for local completion, as it would be exceedingly slow. For good user experience, we aim at a latency of about ~1 second or less per completion suggestion, while utilizing the full context of the model at the same time. Read more on how we solve this problem further down the text.

Global context

In addition to the local context around the current cursor location, we can significantly improve the quality of the generated suggestions by including extra "global" context. This extra context can come either from other places in the same file that we are currently editing, or from other recently edited or opened files. There are a lot of different techniques for deciding which extra context specifically to include in the request that could be potentially relevant to the current completion task. In the llama.vim plugin, we use a simple approach:

We create a ring buffer of g:llama_config.ring_n_chunks chunks of g:llama_config.ring_chunk_size lines each
On every completion request we add 1 prefix and 1 suffix chunk, randomly picked relative to the cursor position but in a much larger scope (g:llama_config.ring_scope lines around the cursor)
Upon entering and leaving a Vim buffer, we pick a chunk around the last cursor position
Upon saving a file, we pick a chunk around the current cursor position
Upon yanking a text block, we add it as a chunk to the ring buffer
Upon trying to add a chunk, we evict old chunks that are very similar to the new one

" gather chunks upon yanking
autocmd TextYankPost * if v:event.operator ==# 'y' | call s:pick_chunk(v:event.regcontents, v:false, v:true) | endif

" gather chunks upon entering/leaving a buffer
autocmd BufEnter     * call timer_start(100, {-> s:pick_chunk(getline(max([1, line('.') - g:llama_config.ring_chunk_size/2]), min([line('.') + g:llama_config.ring_chunk_size/2, line('$')])), v:true, v:true)})
autocmd BufLeave     * call                      s:pick_chunk(getline(max([1, line('.') - g:llama_config.ring_chunk_size/2]), min([line('.') + g:llama_config.ring_chunk_size/2, line('$')])), v:true, v:true)

" gather chunk upon saving the file
autocmd BufWritePost * call s:pick_chunk(getline(max([1, line('.') - g:llama_config.ring_chunk_size/2]), min([line('.') + g:llama_config.ring_chunk_size/2, line('$')])), v:true, v:true)

Upon each FIM completion request, we now send both the local and global contexts together. The latter is passed through the "input_extra" field of the /infill request in the following format:

[
    {
        "filename": string
        "text": string,
    },
    ... max of g:llama_config.ring_n_chunks ...
]

With this design, as we edit the files in our Neovim session, the overall context grows to a certain amount (determined by the ring buffer size) and usually contains up-to-date relevant information for the editing task at hand. The specific events and logic for gathering chunks can be easily modified and customized if needed.

Note that the entire state of the context is stored client-side and is sent to the server on each request.

Server-side processing

Upon receiving a request with N extra context chunks, the server constructs the following repo-level FIM prompt:

<|repo_name|>{repo_name}    " --\
<|file_sep|>{filename_0}    "   |
{text_0}                    "   |
<|file_sep|>{filename_1}    "   | extra (global) prompt
{text_1}                    "   |
...                         "   |
<|file_sep|>{filename_N-1}  "   |
{text_N-1}                  " --/
<|file_sep|>{filename}                                              " --\
<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}  " --/ local FIM prompt

This is based on the work in https://arxiv.org/pdf/2409.12186. Note that not all models are trained for this pattern, so it is recommended to use models that support it, such as Qwen2.5-Coder. This prompt format has important advantages that allow efficient context reuse, discussed in the following paragraphs.

In this FIM prompt, the components correspond to:

<|repo_name|>, <|file_sep|>, <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|> - special tokens defined by the model
filename_i - the filename of the i'th chunk in the "input_extra" array
text_i - the text of the i'th chunk in the "input_extra" array
prefix, suffix, prompt - the input from the "input_prefix", "input_suffix", and "prompt" fields of the request

The server processes the constructed prompt and then generates a maximum number of tokens that represent the FIM completion. The generation can be terminated early by several different conditions:

An end-of-generation (EOG) token is sampled
A maximum time-limit optionally specified by the client is exceed
An indentation constraint optionally specified by the client is not satisfied

The generated text is sent back to the client for display as a suggestion via virtual text overlay.

KV cache reuse : global prefix

The first optimization technique for improving long-context performance is to simply reuse the computed KV cache common prefix from the previous request. This allows us to very efficiently append new chunks of extra context, in-between the <|fim_prefix|> token and the existing chunks in the extra context:

<|repo_name|>{repo_name}    " --\
<|file_sep|>{filename_0}    "   |
{text_0}                    "   |
<|file_sep|>{filename_1}    "   | extra context, cached and reused (no processing)
{text_1}                    "   |
...                         "   |
<|file_sep|>{filename_N-1}  "   |
{text_N-1}                  " --/
<|file_sep|>{filename_N}    " --\ new chunk,
{text_N}                    " --/ processed with the new request
<|file_sep|>{filename}
<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}

Reusing the KV cache prefix is supported generally by the llama-server and requires simply to provide the "cache_prompt": true flag in the completion requests. With this option, each new completion request will reuse the largest common prefix of tokens between the old and the new request. This saves a large part of the prompt processing in situations where the extra context does not change, or was extended by appending a new chunk at the end.

KV cache reuse : context shift

The previous optimization is only useful up to g:ring_n_chunks chunks of extra context. When the ring buffer becomes full, the first chunk would be evicted and would therefore "shift" all following chunks into a new position relative to the start of the prompt:

<|repo_name|>{repo_name}  " --\
<|file_sep|>{filename_1}  "   |
{text_1}                  "   |
<|file_sep|>{filename_2}  "   | chunk 0 has been evicted
{text_2}                  "   | the rest of the chunks have 'moved' one step towards the front
...                       "   |
<|file_sep|>{filename_N}  "   |
{text_N}                  " --/
<|file_sep|>{filename}
<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}

Because of this relative shift of D0 tokens, it is no longer possible to directly reuse the KV cache of the extra context. The reason for this is because the position of the tokens is encoded inside the KV cache data (e.g. via the RoPE operator) and now the tokens are no longer in those particular positions (for more info, see #71 (comment)).

However, quite early in the project (#2060), we realized that the cache in this case can actually be efficiently reused by "updating" the encoded positions in the K cache. This follows from the observation that the RoPE operator is "additive". Roughly speaking, applying a RoPE with position p1 = p0 + d is equivalent to applying:

RoPE at position p0
RoPE at position d on the already RoPE'd data in the previous step

This provides a very cheap way to "move" the remaining chunks in the ring buffer forward, towards the beginning of the context: simply apply RoPE with position -D0 to all tokens in the K cache that we want to reuse. Doing so, we can again save the computation of a large portion of the extra prompt.

Note that the described context shifting method is not mathematically identical to recomputing the entire prompt from scratch. It can be easily seen that the embeddings at each token position are "entangled" with all the embeddings before that position, so simply "shifting" the K cache positions will not produce the exact same numbers as full reprocessing. Regardless of this, the context shifting feature has been applied and used by the local llama.cpp community for more than an year now and empirical results indicate that it is very effective and does not seem to degrade the quality of the output in a significant way. The cache reuse techniques described here heavily rely on this "trick".

The described context shifting strategy can also be applied when the evicted chunk is somewhere in the middle of the ring buffer or even if there are multiple evicted chunks at a time. A detailed description of the implementation can be found in #5793 and #9866.

This context reuse strategy requires the llama-server to be started with the --cache-reuse N command-line argument. The N argument is the minimum size of the chunks (in number of tokens) that we will accept and shift in the KV cache for reuse purposes. The logic is that we don't want to reuse very small bits (e.g. individual tokens) from random places of the old context and instead we are interested in reusing large continuous blocks. Note that the implementation preserves the order of the reused chunks, so that a shifted chunk will never move over another chunk (i.e. reused chunks always appear in the same order to each other as when they were originally computed).

Applying these two techniques, we can now efficiently update the extra context of our FIM requests by adding and evicting chunks any way the client decides. Existing chunks will not be recomputed and the server will process only new chunks that were not present in the previous request. The llama.vim plugging periodically updates the extra context ring buffer on the client side and sends the information to the server whenever it detects inactivity (i.e. the cursor hasn't moved for certain period of time or we are currently in Normal mode). This makes the processing of the extra global context almost entirely seamless for the user, mitigating a huge portion of the latency in the naive approach.

KV cache reuse : local prefix

Let's now focus again on the local context part of the request and explain one additional cache reuse strategy that helps to further reduce the completion latency in some typical cases. All of the following examples will assume the PSM (Prefix-Suffix-Middle) FIM pattern. Similar analysis can be made for the SPM (Suffix-Prefix-Middle) pattern which is supported via the --spm-infill command line argument of llama-server.

" the PSM FIM pattern
<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}

Assume also that we are in the middle of editing a line of text and the client has already received a suggestion from the server:

" this is the text file that we are editing:

{prefix lines outside the scope of the local FIM context}
{prefix_line_1}
{prefix_line_2}
{prefix_line_3}
...
{prefix_line_P}
{cur_line_prefix}█{cur_line_suffix}  " --> currently have a completion suggestion for this position
{suffix_line_1}
{suffix_line_2}
...
{suffix_line_S}
{suffix lines outside the scope of the local FIM context}

Here is how the local FIM prompt looks like in more details:

<|fim_prefix|>{prefix_line_1}
{prefix_line_2}
{prefix_line_3}
...
{prefix_line_P}
<|fim_suffix|>{cur_line_suffix}
{suffix_line_1}
{suffix_line_2}
...
{suffix_line_S}
<|fim_middle|>{cur_line_prefix}{generated_suggestion}

From here, there are 3 typical follow-up completion requests that occur in most situations:

Same line FIM: the cursor moves left or right on the same line
Next line FIM: the cursor moves to the next line
Prev line FIM: the cursor moves to the previous line

Same line FIM

For clarity, assume the cursor moved {dx} tokens to the right (moving to the left follows the same logic). The new FIM prompt would look like this:

<|fim_prefix|>{prefix_line_1}         " --\
{prefix_line_2}                       "   |
{prefix_line_3}                       "   | the cache is reused from the previous request
...                                   "   |
{prefix_line_P}                       " --/
<|fim_suffix|>{cur_line_suffix - dx}  " --\
{suffix_line_1}                       "   |
{suffix_line_2}                       "   |
...                                   "   | computed in the new request
{suffix_line_S}                       "   |
<|fim_middle|>{cur_line_prefix + dx}  " --/

In this case the entire local prefix will be reused since it's contents and position are the same as in the previous request. This means that attempting FIM anywhere on the same line will be quite cheap and will involve recomputing only the suffix tokens.

Next line FIM

In this case, the new FIM prompt after moving to the next line, looks like this:

<|fim_prefix|>{prefix_line_2}    " --\
{prefix_line_3}                  "   |
{prefix_line_4}                  "   | the cache is reused from previous request via context shift
...                              "   |
{prefix_line_P}                  " --/
{prefix_line_P+1}                " --> this is a new line added to the FIM prefix
<|fim_suffix|>{new_line_suffix}  " --\
{suffix_line_2}                  "   |
{suffix_line_3}                  "   |
...                              "   | computed in the new request
{suffix_line_S+1}                "   |
<|fim_middle|>{new_line_prefix}  " --/

The old {prefix_line_1} line is now out of the FIM prefix scope and a new {prefix_line_P+1} line is within the FIM prefix scope. We can reuse the cache for lines [2, P] via context shifting, as explained earlier. So in this case, we compute only the new prefix line {prefix_line_P+1}, together with the new FIM suffix.

Prev line FIM

This case is the most cache unfriendly one. Moving a line up, the new FIM prompt will look like this:

<|fim_prefix|>{prefix_line_0}    " --> this line is completely new, so it breaks the cache reuse sequence very early
{prefix_line_1}
{prefix_line_2}
...
{prefix_line_P-1}
<|fim_suffix|>{new_line_suffix}
{suffix_line_0}
{suffix_line_1}
...
{suffix_line_S-1}
<|fim_middle|>{new_line_prefix}

Because we haven't computed the {prefix_line_0} line in the previous request, the cache reuse logic has to stop at the very start of the local FIM prompt. Therefore in this case we don't reuse any of the previous local FIM cache and we need to compute the entire local FIM prompt.

Expected performance

On each FIM request, the server takes a maximum of 1 full batch of tokens from the provided local context. The prefix and suffix tokens are split in a ratio of 3:1:

llama.cpp/examples/server/server.cpp

Lines 2055 to 2062 in 32927e6

    
           // for now pick FIM context to fit in a batch (ratio prefix:suffix = 3:1, TODO: configurable?) 
        
           const int n_suffix_take = std::min<int>(tokens_suffix.size(),   (n_batch/4)); 
        
           const int n_prefix_take = std::min<int>(tokens_prefix.size(), 3*(n_batch/4) - 3); 
        
           // fill the rest of the context with extra chunks 
        
           const int n_extra_take = std::min<int>(std::max<int>(0, slot.n_ctx - (n_batch) - 2*slot.n_predict), slot.extra_tokens.size());

This means that for new FIM requests, there will be at most --batch tokens to process, while in most cases the processed tokens would be much less due to the cache reuse optimizations described above. Knowing this, we can estimate the typical performance of FIM requests using the llama-batched-bench tool. Here are some analysis on M1 Pro and M2 Ultra using Qwen2.5-Coder 1.5B and 7B models:

M1 Pro

./llama-batched-bench -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -c 32768 -b 1024 -npp 1024,2048,15360,16384,30720,31744 -ntg 32 -npl 1 -fa

main: n_kv_max = 32768, n_batch = 1024, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
1024	32	1	1056	0.864	1184.90	0.446	71.68	1.311	805.71
2048	32	1	2080	1.757	1165.38	0.461	69.35	2.219	937.45
15360	32	1	15392	22.336	687.68	0.675	47.42	23.011	668.90
16384	32	1	16416	24.443	670.31	0.691	46.30	25.134	653.15
30720	32	1	30752	62.397	492.34	0.931	34.37	63.328	485.60
31744	32	1	31776	65.603	483.88	0.948	33.75	66.551	477.47

M2 Ultra

./llama-batched-bench -m models/qwen2.5-7b-coder/ggml-model-q8_0.gguf -c 32768 -b 1024 -npp 1024,2048,15360,16384,30720,31744 -ntg 32 -npl 1 -fa

main: n_kv_max = 32768, n_batch = 1024, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
1024	32	1	1056	0.815	1257.06	0.487	65.73	1.301	811.40
2048	32	1	2080	1.597	1282.41	0.505	63.33	2.102	989.40
15360	32	1	15392	16.988	904.18	0.746	42.89	17.734	867.94
16384	32	1	16416	18.456	887.76	0.774	41.35	19.229	853.69
30720	32	1	30752	43.314	709.24	1.026	31.20	44.340	693.56
31744	32	1	31776	45.426	698.80	1.045	30.63	46.471	683.78

From these numbers we can estimate the prompt processing and text generation speeds, as well as the expected FIM time at different levels of context occupation. Here we assume that the FIM request would require to process 1/4 of --batch tokens as prompt and generate 32 tokens as suggestion:

M1 Pro, LLM 1.5B, Q8_0:

- empty context: p: ~1150 t/s | g: ~70 t/s
- half  context: p:  ~480 t/s | g: ~47 t/s
- full  context: p:  ~320 t/s | g: ~34 t/s

expected FIM time in ms:

batch empty half full

256 512.80 814.18 1141.18

512 568.45 947.52 1341.18

1024 679.75 1214.18 1741.18

M2 Ultra, LLM 7B, Q8_0:

- empty context: p: ~1300 t/s | g: ~64 t/s
- half  context: p:  ~700 t/s | g: ~42 t/s
- full  context: p:  ~480 t/s | g: ~31 t/s

expected FIM time in ms:

batch empty half full

256 549.23 853.33 1165.59

512 598.46 944.76 1298.92

1024 696.92 1127.62 1565.59

Examples

Using `llama.vim` on M1 Pro (2021) with `Qwen2.5-Coder 1.5B Q8_0`:

The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is 15186 tokens and the maximum is 32768. There are 30 chunks in the ring buffer with extra context (out of 64). So far, 1 chunk has been evicted in the current session and there are 0 chunks in queue. The newly computed prompt tokens for this request were 260 and the generated tokens were 25. It took 1245 ms to generate this suggestion after entering the letter c on the current line.

Using `llama.vim` on M2 Ultra with `Qwen2.5-Coder 7B Q8_0`:

llama.vim-0-lq.mp4

Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.

TODO

Future ideas

Cleverer chunk collection (reranking?)
Display (and edit?) the extra context in Vim
https://github.com/junegunn/vim-plug support
Cache chunk tokenizations server-side and reuse
Cache completions client-side and reuse on auto FIM
Add Vim support
VS Code plugin

An updated version will be added in #9787

ggerganov · 2024-10-15T14:48:45Z

This plugin (or script?) was quite fun to implement! Will be merging after a few days of testing. If anyone gives this a try, would be happy to hear any feedback. This is running pretty smooth on M2 Ultra with Qwen2.5 7B Q8, though I think it should work reasonably well even on lower end hardware.

ggml-ci

Green-Sky · 2024-10-21T17:29:52Z

You explicitly state neovim, but is there anything you use that prevents the use of vim?

ggerganov · 2024-10-21T17:33:38Z

As far as I know, the async job and virtual text APIs are a bit different in Vim. Though it's probably quite easy to adapt the script to work both with Vim and Neovim.

'eol' messes up the rendering with nvim v0.10.2 for some reason

m18coppola · 2024-10-22T07:31:28Z

@Green-Sky I have a treat for you in #9995

ggerganov · 2024-10-28T12:59:37Z

The llama.vim plugin is now available as a standalone repo at https://github.com/ggml-org/llama.vim. This makes it possible to install the plugin through popular plugin managers.

Further development will continue in the https://github.com/ggml-org/llama.vim repo.

An updated version will be added in ggerganov#9787

'eol' messes up the rendering with nvim v0.10.2 for some reason

An updated version will be added in ggerganov#9787

'eol' messes up the rendering with nvim v0.10.2 for some reason

An updated version will be added in ggerganov#9787

'eol' messes up the rendering with nvim v0.10.2 for some reason

ggerganov force-pushed the llama.vim branch from 949c928 to 25f3b4d Compare October 8, 2024 11:30

github-actions bot added examples server labels Oct 8, 2024

ggerganov force-pushed the llama.vim branch 2 times, most recently from 73fa77d to 391ea30 Compare October 9, 2024 07:28

ggerganov mentioned this pull request Oct 9, 2024

llama : improve infill support and special token detection #9798

Merged

4 tasks

ggerganov added a commit that referenced this pull request Oct 9, 2024

examples : remove llama.vim

3dc48fe

An updated version will be added in #9787

ggerganov force-pushed the llama.vim branch from 391ea30 to 84a5061 Compare October 9, 2024 08:02

ggerganov changed the base branch from master to gg/infill-0 October 9, 2024 08:05

ggerganov force-pushed the llama.vim branch from 776c885 to 59777c6 Compare October 10, 2024 10:38

ggerganov force-pushed the gg/infill-0 branch from 32da4a2 to 3681540 Compare October 11, 2024 07:07

ggerganov force-pushed the llama.vim branch 2 times, most recently from 76e5d87 to cefd4ac Compare October 11, 2024 09:59

ggerganov changed the title ~~llama : improve infill support + neovim plugin~~ llama.vim : plugin for Neovim Oct 11, 2024

ggerganov marked this pull request as ready for review October 11, 2024 10:32

Base automatically changed from gg/infill-0 to master October 12, 2024 05:21

ggerganov force-pushed the llama.vim branch 3 times, most recently from c7d8904 to d2c559a Compare October 13, 2024 10:43

ggerganov changed the base branch from master to gg/server-reuse-context October 13, 2024 10:45

Base automatically changed from gg/server-reuse-context to master October 13, 2024 15:52

ggerganov force-pushed the llama.vim branch 3 times, most recently from 5155b68 to acf6d19 Compare October 15, 2024 14:19

ggerganov force-pushed the llama.vim branch from a48830d to 6e26fe5 Compare October 18, 2024 12:54

ggerganov added 4 commits October 21, 2024 11:00

llama : add infill sampler

5aaf247

llama.vim : neovim plugin

0566c69

llama.vim : fix suffix construction + fix virt text offset

0c649c8

llama.vim : handle space

07e7dd4

ggerganov added 6 commits October 21, 2024 11:00

llama.vim : final touches

4583aef

ggml-ci

llama.vim : fix repetitions of existing text

d1b8b21

llama.vim : complete only whithin the local scope [no ci]

1600d84

llama.vim : display ring capacity [no ci]

6bb6e6d

llama.vim : fix large chunk accept + comments [no ci]

fe78c39

llama.vim : minor [no ci]

b8efb07

ggerganov force-pushed the llama.vim branch from 01f3980 to b8efb07 Compare October 21, 2024 08:02

ggerganov added 2 commits October 21, 2024 12:32

llama.vim : remove on-hold code + fixes [no ci]

32927e6

llama.vim : minor [no ci]

8fb5154

ggerganov merged commit dbd5f2f into master Oct 21, 2024
1 check passed

ggerganov deleted the llama.vim branch October 21, 2024 17:25

ggerganov added a commit that referenced this pull request Oct 21, 2024

llama.vim : move info to the right of screen [no ci] (#9787)

e01c67a

'eol' messes up the rendering with nvim v0.10.2 for some reason

ggerganov added a commit that referenced this pull request Oct 21, 2024

llama.vim : fix info text display [no ci] (#9787)

e94a138

ggerganov mentioned this pull request Oct 22, 2024

Expand AI Code Completion beyond Copilot and Supermaven zed-industries/zed#18490

Open

1 task

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

examples : remove llama.vim

8516a43

An updated version will be added in ggerganov#9787

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama.vim : plugin for Neovim (ggerganov#9787)

f765bcd

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama.vim : move info to the right of screen [no ci] (ggerganov#9787)

55450b3

'eol' messes up the rendering with nvim v0.10.2 for some reason

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama.vim : fix info text display [no ci] (ggerganov#9787)

18b2f3c

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

examples : remove llama.vim

b6c2912

An updated version will be added in ggerganov#9787

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama.vim : plugin for Neovim (ggerganov#9787)

276f310

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama.vim : move info to the right of screen [no ci] (ggerganov#9787)

6b997df

'eol' messes up the rendering with nvim v0.10.2 for some reason

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama.vim : fix info text display [no ci] (ggerganov#9787)

7540083

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

examples : remove llama.vim

4f12202

An updated version will be added in ggerganov#9787

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama.vim : plugin for Neovim (ggerganov#9787)

4797f3f

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama.vim : move info to the right of screen [no ci] (ggerganov#9787)

9080044

'eol' messes up the rendering with nvim v0.10.2 for some reason

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama.vim : fix info text display [no ci] (ggerganov#9787)

4ad7cac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.vim : plugin for Neovim #9787

llama.vim : plugin for Neovim #9787

ggerganov commented Oct 8, 2024 •

edited

Loading

ggerganov commented Oct 15, 2024

Green-Sky commented Oct 21, 2024

ggerganov commented Oct 21, 2024

m18coppola commented Oct 22, 2024

ggerganov commented Oct 28, 2024

	" general parameters:
	"
	" endpoint: llama.cpp server endpoint
	" n_prefix: number of lines before the cursor location to include in the prefix
	" n_suffix: number of lines after the cursor location to include in the suffix
	" n_predict: max number of tokens to predict
	" t_max_prompt_ms: max alloted time for the prompt generation (TODO: not yet supported)
	" t_max_predict_ms: max alloted time for the prediction
	" show_info: show extra info about the inference (0 - disabled, 1 - statusline, 2 - inline)
	" auto_fim: trigger FIM completion automatically on cursor movement
	" max_line_suffix: do not auto-trigger FIM completion if there are more than this number of characters to the right of the cursor
	"
	" ring buffer of chunks, accumulated with time upon:
	"
	" - completion request
	" - yank
	" - entering a buffer
	" - leaving a buffer
	" - writing a file
	"
	" parameters for the ring-buffer with extra context:
	"
	" ring_n_chunks: max number of chunks to pass as extra context to the server (0 to disable)
	" ring_chunk_size: max size of the chunks (in number of lines)
	" note: adjust these numbers so that you don't overrun your context
	" at ring_n_chunks = 64 and ring_chunk_size = 64 you need ~32k context
	" ring_scope: the range around the cursor position (in number of lines) for gathering chunks after FIM
	" ring_update_ms: how often to process queued chunks in normal mode
	"
	let s:default_config = {
	\ 'endpoint': 'http://127.0.0.1:8012/infill',
	\ 'n_prefix': 256,
	\ 'n_suffix': 8,
	\ 'n_predict': 64,
	\ 't_max_prompt_ms': 500,
	\ 't_max_predict_ms': 200,
	\ 'show_info': 2,
	\ 'auto_fim': v:true,
	\ 'max_line_suffix': 8,
	\ 'ring_n_chunks': 64,
	\ 'ring_chunk_size': 64,
	\ 'ring_scope': 1024,
	\ 'ring_update_ms': 1000,
	\ }


	// for now pick FIM context to fit in a batch (ratio prefix:suffix = 3:1, TODO: configurable?)
	const int n_suffix_take = std::min<int>(tokens_suffix.size(), (n_batch/4));
	const int n_prefix_take = std::min<int>(tokens_prefix.size(), 3*(n_batch/4) - 3);

	// fill the rest of the context with extra chunks
	const int n_extra_take = std::min<int>(std::max<int>(0, slot.n_ctx - (n_batch) - 2*slot.n_predict), slot.extra_tokens.size());

batch	empty	half	full
256	512.80	814.18	1141.18
512	568.45	947.52	1341.18
1024	679.75	1214.18	1741.18

batch	empty	half	full
256	549.23	853.33	1165.59
512	598.46	944.76	1298.92
1024	696.92	1127.62	1565.59

llama.vim : plugin for Neovim #9787

llama.vim : plugin for Neovim #9787

Conversation

ggerganov commented Oct 8, 2024 • edited Loading

The plugin is now developed here: https://github.com/ggml-org/llama.vim

Overview

Features

Usage

High-end hardware with GPU

Mid-end hardware with GPU

Low-end hardware with GPU

Low-end hardware (CPU only)

Backend changes

Debugging

Technical details

Global context

Server-side processing

KV cache reuse : global prefix

KV cache reuse : context shift

KV cache reuse : local prefix

Same line FIM

Next line FIM

Prev line FIM

Expected performance

M1 Pro

M2 Ultra

M1 Pro, LLM 1.5B, Q8_0:

M2 Ultra, LLM 7B, Q8_0:

Examples

Using llama.vim on M1 Pro (2021) with Qwen2.5-Coder 1.5B Q8_0:

Using llama.vim on M2 Ultra with Qwen2.5-Coder 7B Q8_0:

TODO

Future ideas

ggerganov commented Oct 15, 2024

Green-Sky commented Oct 21, 2024

ggerganov commented Oct 21, 2024

m18coppola commented Oct 22, 2024

ggerganov commented Oct 28, 2024

ggerganov commented Oct 8, 2024 •

edited

Loading

Using `llama.vim` on M1 Pro (2021) with `Qwen2.5-Coder 1.5B Q8_0`:

Using `llama.vim` on M2 Ultra with `Qwen2.5-Coder 7B Q8_0`: