server : refactor slot input data, move tokenizer to HTTP thread #10023

ngxson · 2024-10-23T21:19:57Z

Motivation

The main motivation of this PR is to get rid of having json prompt as slot input data. The json data format is quite dangerous and messy to work with, as we now have to support many input shapes:

this supports these cases:
- "prompt": "string"
- "prompt": [12, 34, 56]
- "prompt": [12, 34, "string", 56, 78]
and multiple prompts (multi-tasks):
- "prompt": ["string1", "string2"]
- "prompt": ["string1", [12, 34, 56]]
- "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]

In addition, we're currently doing some post-processing (i.e. format chat template) at HTTP level, but some other are done in the inference thread (i.e. format prompt for rerank & infill)

In this PR

I tried moving things around and defining a pattern:

For HTTP thread, what it does:

Validate input types (somewhat, because some fields are still being validated by launch_slot_with_task)
Apply format (chat template, rerank template, infill format)
Tokenize the prompt --> tokens are put into task.prompt_tokens

The slot will always take an array of tokens as input, saved into slot.prompt_tokens

TODO

Add test case for infill --> use stories260K with added FIM tokens
Remove redundant ctx_server.tokenize function
Do not throw error on empty prompt
Update readme
(maybe) rename SERVER_TASK_TYPE_COMPLETION to _INFERENCE to better reflect what it does

I have read the contributing guidelines
Self-reported review complexity:
- Medium

ngxson · 2024-10-23T22:07:17Z

@ggerganov Could you please share some curl commands that you used for testing /infill api? I would like to add them as test cases. Thank you.

ggerganov · 2024-10-24T08:35:18Z

Here is a simple test that verifies that "input_extra" is used during /infill:

curl \
    --silent --no-buffer --request POST \
    --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_extra": [{"filename": "llama.h", "text": "LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n"}], "input_suffix": "}\n", "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = ", "prompt": "", "top_k": 1, "stop": ["\n"]}' | jq

{
  "content": "llama_n_threads(NULL);",
  ...
}

Not sure what would be the smallest FIM model that this work work with. I've tested with Qwen2.5 1.5B, but it might be too big for the server tests script. If you can figure out a way to do it, would be very useful to test the /infill functionality.

In any case, I'm planning to add similar tests to ci/run.sh where we can afford to run a bigger model in the range of 1-2B.

ngxson · 2024-10-24T14:10:05Z

I ended up add FIM tokens to the existing stories260K to make it compatible with /infill endpoint, so that it can be used in tests. In the future, we could also fine tune SmolLM 360M to allow it to infill the code (as the base model is already trained on coding problems)

I ran the same test on both master and this PR, and obtained the same result.

One thing that I noticed while testing, seems like extra_tokens is not correctly added to the prompt_tokens in some cases. For example:

./llama-server -m ../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf -c 4096 -np 2 --verbose

And send the request mentioned in your last message (in my case, with n_predict = 2)

{
    "input_extra": [
        {
            "filename": "llama.h",
            "text": "LLAMA_API int32_t llama_n_threads();\n"
        }
    ],
    "input_suffix": "}\n",
    "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_",
    "prompt": "",
    "temperature": 0,
    "seed": 42,
    "n_predict": 2
}

Then observe the formatted prompt (please note that, slot.prompt is removed in this PR; the prompt in the response below is detokenize(slot.prompts_token), which should be more useful for debugging)

{
    "content": "get_num",
    "id_slot": 0,
    "stop": true,
    "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
    "tokens_predicted": 2,
    "tokens_evaluated": 27,
    "generation_settings": {
        "n_ctx": 2048,
        "n_predict": -1,
        "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
        "seed": 42,
        "seed_cur": 42,
        "temperature": 0.0,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "tfs_z": 1.0,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "penalize_nl": false,
        "stop": [],
        "max_tokens": 2,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "",
        "samplers": [
            "top_k",
            "tfs_z",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ]
    },
    "prompt": "filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>",
    "has_new_line": false,
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 28,
    "timings": {
        ...
    },
    "index": 0
}

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR. Maybe you can fix it in another PR @ggerganov

ggerganov · 2024-10-24T16:36:42Z

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR

Yes, this logic seems to have issues - thank you for noticing this. I will fix this in a follow up PR.

ggerganov

Looks really well done 👍

wwoodsTM · 2024-10-24T18:12:21Z

@ngxson Just wanted to add that I really appreciate you integrating such a robust way to deal with the different kinds of prompts that are possible. I am not sure what you may have already been planning around this but if somehow my comments in the other PR about how important versatility was to me in my own situation helped inspire any of your ideas here, then I am honored that you would so rapidly incorporate that. Either way, the effort is very appreciated. Thank you!

chrisstankevitz · 2024-10-31T18:03:18Z

I suspect/wonder_if this broke #7728. Previously server_context::get_available_slot attempted to find a slot whose prompt is similar to the task prompt common_part(slot_prompt, task_prompt). Now it compares the slot's prompt to itself and doesn't even use the task prompt: longest_common_prefix(slot.cache_tokens, slot.prompt_tokens).

ngxson · 2024-10-31T22:10:15Z

Now it compares the slot's prompt to itself and doesn't even use the task prompt

@chrisstankevitz slot.prompt_tokens contains tokens from the input prompt, so I don't understand what's the problem here. Can you elaborate?

(I would appreciate having logs and reproducible steps)

chrisstankevitz · 2024-10-31T22:48:21Z

@ngxson you are correct that slot.prompt_tokens contains the tokens from the input prompt -- those are the tokens from the "prior" input prompt -- the prompt that was most recently used on that slot. But there are another set of tokens that need to be compared -- the tokens from the new/current "task prompt".

The idea behind the logic is to:

Take the "task prompt" (the prompt from the new request)
Compare the "task prompt" to each "slot prompt".
Identify the slot whose [prior] "slot prompt" is most similar to the [new/current] "task prompt"
Prefer to use the slot identified in (3)

Original Pseudo-Code

for (slot : slots)
{
  ...
  common_part(slot.prompt, task,prompt)
  ...
}

New Pseudo-Code (possibly wrong)

for (slot : slots)
{
  ...
  longest_common_prefix(slot.prompt_tokens, slot.cache_tokens)
  ...
}

You can see that the original code was comparing 1) the [old] slot prompt to 2) the [new] task prompt.

But after the change, the code is comparing 1) the [old] slot prompt tokens to 2) the slot's "cache_tokens". This is incorrect: should not be comparing the slot's prompt-tokens to same slot's "cache_tokens". Instead the logic should be comparing the [old] slot's prompt-tokens to the [new] task prompt-tokens.

New Pseudo-Code (probably correct)

for (slot : slots)
{
  ...
  longest_common_prefix(slot.prompt_tokens, task.prompt_tokens)
  ...
}

Another clue that something is amiss: the function server_context::get_available_slot not longer uses the task prompt (which is passed as input). The "task prompt" is critical to this "common prefix" logic and it should be getting used.

I do not use this logic, so I do not have an example... I was just reviewing recent changes to server.cpp. And it didn't look right to me.

ngxson · 2024-10-31T23:08:15Z

OK I see, you are correct that my change make it to compare between the input prompt_tokens and cached_tokens, but not input prompt_tokens and the prior prompt_tokens from the last run.

But in fact, that was my intent to do so.

The whole reason why we want to check longest_common_prefix(prompt, cached_tokens) is because we want to reuse cached tokens. Comparing new vs prior prompt_tokens won't make sense because some tokens in the old prompt may not be in the cache (i.e. due to context shifting or prompt truncating). Indeed, in some cases, this negatively impact the performance.

I also want to note that, the longest_common_prefix ma be out-dated because of #9866 . Basically, before, we can only reuse caches from prompts having same prefix. But now, we can reuse different chunks in the prompt, so we may need something like longest_common_chunk in the future.

chrisstankevitz · 2024-10-31T23:30:07Z

I might be misunderstanding, but it sounds like you are saying it is your intent to ignore the [new] task's prompt tokens. If so, I'm afraid that you are misunderstanding the point of #7728 which is to find an [old] slot that is similar to the [new] task. But if you insist on ignoring the task's prompt, then you should at least remove the task's prompt from the argument list of server_context::get_available_slot -- because it is not used anymore.

chrisstankevitz · 2024-10-31T23:32:20Z

@sasha0552 would you please comment on this?

ngxson · 2024-10-31T23:44:41Z

examples/server/server.cpp

                // length of the Longest Common Prefix between the current slot's prompt and the input prompt
-                int lcp_len = longest_common_prefix(slot_prompt, prompt);
+                int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens);


OK sorry I see now. Indeed, I think the old code is quite confusing (and that's why my new version is incorrect).

The old code is:

std::string slot_prompt = slot.prompt.get<std::string>(); int lcp_len = longest_common_prefix(slot_prompt, prompt);

So when refactor I thought that slot_prompt is indeed task.prompt

The fix would be (as you said) to change the second argument to task.prompt_tokens. Do you want to make a PR @chrisstankevitz ?

Suggested change

int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens);

int lcp_len = longest_common_prefix(slot.cache_tokens, task.prompt_tokens);

Also please note that here we compare:

slot.cache_tokens, task.prompt_tokens

But not

slot.prompt_tokens, task.prompt_tokens

Because some tokens stored in the past prompt_tokens may not be in cache

sasha0552 · 2024-11-01T08:37:31Z

slot.cache_tokens, task.prompt_tokens

As I see it, task.prompt_tokens is just a tokenized prompt directly from the prompt, so it should be fine.

Comparing slot.cache_tokens, slot.prompt_tokens is indeed pointless.

I also want to note that, the longest_common_prefix ma be out-dated because of #9866 . Basically, before, we can only reuse caches from prompts having same prefix. But now, we can reuse different chunks in the prompt, so we may need something like longest_common_chunk in the future.

After #9866, longest common prefix algorithm should be replaced by longest common substring algorithm at least (or perhaps something more complex). In #7728 I intended to use LCS initially precisely because of the possibility (in the future) to reuse computed tokens in the middle.

…rganov#10023) * server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere

server : refactor slot input data, move tokenizer to HTTP thread

125835b

github-actions bot added examples server labels Oct 23, 2024

ngxson added 4 commits October 23, 2024 23:39

move prompt_tokens.empty() check

5c749be

Merge branch 'master' into xsn/refactor_server_slot_input

3abc339

fix incorrect if branch

60d4194

fix infinite generation loop

b550011

ngxson added 4 commits October 24, 2024 13:20

bring back infill validation

cff97ad

add infill test

fea5ca4

try fixing format_infill

07381f7

fix test

c34ab08

github-actions bot added the python python script changes label Oct 24, 2024

ngxson added 3 commits October 24, 2024 16:23

remove redundant code

575b133

rename completion to inference

4a9f3e7

update docs

13ee779

ngxson marked this pull request as ready for review October 24, 2024 14:43

ngxson requested a review from ggerganov October 24, 2024 14:43

use llama_tokens everywhere

7f7acdb

ggerganov approved these changes Oct 24, 2024

View reviewed changes

ngxson merged commit 958367b into ggerganov:master Oct 24, 2024
54 checks passed

ngxson mentioned this pull request Oct 28, 2024

Llama server - Update doc for slot states #10053

Closed

4 tasks

ngxson commented Oct 31, 2024

View reviewed changes

sasha0552 mentioned this pull request Nov 1, 2024

server : fix smart selection of available slot #10120

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : refactor slot input data, move tokenizer to HTTP thread #10023

server : refactor slot input data, move tokenizer to HTTP thread #10023

ngxson commented Oct 23, 2024 •

edited

Loading

ngxson commented Oct 23, 2024

ggerganov commented Oct 24, 2024

ngxson commented Oct 24, 2024 •

edited

Loading

ggerganov commented Oct 24, 2024

ggerganov left a comment

wwoodsTM commented Oct 24, 2024

chrisstankevitz commented Oct 31, 2024 •

edited

Loading

ngxson commented Oct 31, 2024

chrisstankevitz commented Oct 31, 2024 •

edited

Loading

ngxson commented Oct 31, 2024 •

edited

Loading

chrisstankevitz commented Oct 31, 2024 •

edited

Loading

chrisstankevitz commented Oct 31, 2024

ngxson Oct 31, 2024 •

edited

Loading

ngxson Oct 31, 2024 •

edited

Loading

sasha0552 commented Nov 1, 2024 •

edited

Loading

	int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens);
	int lcp_len = longest_common_prefix(slot.cache_tokens, task.prompt_tokens);

server : refactor slot input data, move tokenizer to HTTP thread #10023

server : refactor slot input data, move tokenizer to HTTP thread #10023

Conversation

ngxson commented Oct 23, 2024 • edited Loading

Motivation

In this PR

TODO

ngxson commented Oct 23, 2024

ggerganov commented Oct 24, 2024

ngxson commented Oct 24, 2024 • edited Loading

ggerganov commented Oct 24, 2024

ggerganov left a comment

Choose a reason for hiding this comment

wwoodsTM commented Oct 24, 2024

chrisstankevitz commented Oct 31, 2024 • edited Loading

ngxson commented Oct 31, 2024

chrisstankevitz commented Oct 31, 2024 • edited Loading

Original Pseudo-Code

New Pseudo-Code (possibly wrong)

New Pseudo-Code (probably correct)

ngxson commented Oct 31, 2024 • edited Loading

chrisstankevitz commented Oct 31, 2024 • edited Loading

chrisstankevitz commented Oct 31, 2024

ngxson Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

sasha0552 commented Nov 1, 2024 • edited Loading

ngxson commented Oct 23, 2024 •

edited

Loading

ngxson commented Oct 24, 2024 •

edited

Loading

chrisstankevitz commented Oct 31, 2024 •

edited

Loading

chrisstankevitz commented Oct 31, 2024 •

edited

Loading

ngxson commented Oct 31, 2024 •

edited

Loading

chrisstankevitz commented Oct 31, 2024 •

edited

Loading

ngxson Oct 31, 2024 •

edited

Loading

ngxson Oct 31, 2024 •

edited

Loading

sasha0552 commented Nov 1, 2024 •

edited

Loading