Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : check that the prompt fits in the slot's context #10030

Merged
merged 2 commits into from
Oct 25, 2024

Conversation

ggerganov
Copy link
Owner

fix #9978

In embedding and reranking mode, a prompt could fit in the batch but could exceed the slot's context size. This PR adds a check to handle such cases gracefully, instead of crashing.

Testing

./llama-server \
    -m ./models/bge-large-zh-v1.5/ggml-model-f16.gguf \
    --port 8012 -a emb@bge-large-zh-v1.5 -ngl 100 \
    --embeddings -ub 8192 -b 8192 --pooling cls

curl \
    http://localhost:8012/v1/embeddings -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{"input": ["'"$(printf 'hello %.0s' $(seq 1 550))"'"], "encoding_format": "float"}'
{
  "error": {
    "code": 500,
    "message": "input is larger than the max context size. skipping",
    "type": "server_error"
  }
}


if (slot.n_prompt_tokens > slot.n_ctx) {
slot.release();
send_error(slot, "input is larger than the max context size. skipping", ERROR_TYPE_SERVER);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest changing some wordings to match the message when number of tokens is larger than n_ubatch

Suggested change
send_error(slot, "input is larger than the max context size. skipping", ERROR_TYPE_SERVER);
send_error(slot, "input is too large to process. increase the context size", ERROR_TYPE_SERVER);

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting to increase the context size can lead to the same issue as the user experienced in #9978 due to setting a context size larger than what the model supports. So I'll leave the version without the suggestion.

@ggerganov ggerganov merged commit bc5ba00 into master Oct 25, 2024
56 checks passed
@ggerganov ggerganov deleted the gg/server-check-ctx branch October 25, 2024 07:13
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: llama-server crash with --embeddings
2 participants