Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement input truncation for llama-cpp-bindings #416

Merged
merged 5 commits into from
Sep 8, 2023

Conversation

wsxiaoys
Copy link
Member

@wsxiaoys wsxiaoys commented Sep 8, 2023

No description provided.

@wsxiaoys wsxiaoys enabled auto-merge (squash) September 8, 2023 10:03
@wsxiaoys
Copy link
Member Author

wsxiaoys commented Sep 8, 2023

I have implemented input truncation by limiting the maximum number of prompts to 1024 tokens. However, I encountered the following error when requesting a large input (~2000 input tokens, which should be clipped into 1024 tokens):

2023-09-08T10:19:17.879429Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:156: Listening at 0.0.0.0:8080
llama_tokenize_with_model: too many tokens
ggml_allocr_alloc: not enough space in the buffer (needed 301989888, largest block available 75497472)
GGML_ASSERT: /Users/meng/Projects/tabby/crates/llama-cpp-bindings/llama.cpp/ggml-alloc.c:144: !"not enough space in the buffer"
[1]    36775 abort      cargo run serve --model /Users/meng/Projects/models/CodeLlama-7B --device

cc @ggerganov

Is this behavior expected? Is there a way to adjust the buffer size? (Upon a quick investigation into ggml-alloc.c, I didn't find an obvious workaround.)

@ggerganov
Copy link

ggerganov commented Sep 8, 2023

There are 2 solutions:

@wsxiaoys
Copy link
Member Author

wsxiaoys commented Sep 8, 2023

Thanks for the information, I implemented it with the second approach - and it works perfectly

2023-09-08T11:03:54.956700Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:156: Listening at 0.0.0.0:8080
llama_tokenize_with_model: too many tokens

llama_print_timings:        load time =  1299.84 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  5759.81 ms /  1918 tokens (    3.00 ms per token,   333.00 tokens per second)
llama_print_timings:        eval time =  3922.20 ms /   127 runs   (   30.88 ms per token,    32.38 tokens per second)
llama_print_timings:       total time =  9723.36 ms

@wsxiaoys wsxiaoys merged commit ad3b974 into main Sep 8, 2023
4 checks passed
@wsxiaoys wsxiaoys deleted the fix-max-input-length branch September 8, 2023 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants