feat: implement input truncation for llama-cpp-bindings #416

wsxiaoys · 2023-09-08T10:02:35Z

No description provided.

wsxiaoys · 2023-09-08T10:25:33Z

I have implemented input truncation by limiting the maximum number of prompts to 1024 tokens. However, I encountered the following error when requesting a large input (~2000 input tokens, which should be clipped into 1024 tokens):

2023-09-08T10:19:17.879429Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:156: Listening at 0.0.0.0:8080
llama_tokenize_with_model: too many tokens
ggml_allocr_alloc: not enough space in the buffer (needed 301989888, largest block available 75497472)
GGML_ASSERT: /Users/meng/Projects/tabby/crates/llama-cpp-bindings/llama.cpp/ggml-alloc.c:144: !"not enough space in the buffer"
[1]    36775 abort      cargo run serve --model /Users/meng/Projects/models/CodeLlama-7B --device

cc @ggerganov

Is this behavior expected? Is there a way to adjust the buffer size? (Upon a quick investigation into ggml-alloc.c, I didn't find an obvious workaround.)

ggerganov · 2023-09-08T10:40:16Z

There are 2 solutions:

increase the n_batch parameter to 1024 (or more). This would result in more memory usage, but will allow you to eval inputs of up to n_batch tokens in a single eval call. By default, this value is 512:

https://github.com/ggerganov/llama.cpp/blob/6ff712a6d1a0c85d996e2f681df57a2554cfe5c1/llama.h#L128
feed your large input into batches of 512 (this is what the main example in llama.cpp does)

wsxiaoys · 2023-09-08T11:05:08Z

Thanks for the information, I implemented it with the second approach - and it works perfectly

2023-09-08T11:03:54.956700Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:156: Listening at 0.0.0.0:8080
llama_tokenize_with_model: too many tokens

llama_print_timings:        load time =  1299.84 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  5759.81 ms /  1918 tokens (    3.00 ms per token,   333.00 tokens per second)
llama_print_timings:        eval time =  3922.20 ms /   127 runs   (   30.88 ms per token,    32.38 tokens per second)
llama_print_timings:       total time =  9723.36 ms

feat: implement input truncation for llama-cpp-bindings

77b5b95

wsxiaoys enabled auto-merge (squash) September 8, 2023 10:03

wsxiaoys disabled auto-merge September 8, 2023 10:19

set max input length to 1024

cdbd401

wsxiaoys added 3 commits September 8, 2023 18:59

fix: batching tokens with n_batches

d7701f1

Merge branch 'main' into fix-max-input-length

f50875d

fix batching

0e62144

wsxiaoys merged commit ad3b974 into main Sep 8, 2023
4 checks passed

wsxiaoys deleted the fix-max-input-length branch September 8, 2023 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement input truncation for llama-cpp-bindings #416

feat: implement input truncation for llama-cpp-bindings #416

wsxiaoys commented Sep 8, 2023

wsxiaoys commented Sep 8, 2023

ggerganov commented Sep 8, 2023 •

edited

Loading

wsxiaoys commented Sep 8, 2023

feat: implement input truncation for llama-cpp-bindings #416

feat: implement input truncation for llama-cpp-bindings #416

Conversation

wsxiaoys commented Sep 8, 2023

wsxiaoys commented Sep 8, 2023

ggerganov commented Sep 8, 2023 • edited Loading

wsxiaoys commented Sep 8, 2023

ggerganov commented Sep 8, 2023 •

edited

Loading