Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : refactor slot input data, move tokenizer to HTTP thread #10023

Merged
merged 13 commits into from
Oct 24, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Oct 23, 2024

Motivation

Ref discussion: #9702 (comment)

The main motivation of this PR is to get rid of having json prompt as slot input data. The json data format is quite dangerous and messy to work with, as we now have to support many input shapes:

this supports these cases:
- "prompt": "string"
- "prompt": [12, 34, 56]
- "prompt": [12, 34, "string", 56, 78]
and multiple prompts (multi-tasks):
- "prompt": ["string1", "string2"]
- "prompt": ["string1", [12, 34, 56]]
- "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]

In addition, we're currently doing some post-processing (i.e. format chat template) at HTTP level, but some other are done in the inference thread (i.e. format prompt for rerank & infill)

In this PR

I tried moving things around and defining a pattern:

For HTTP thread, what it does:

  • Validate input types (somewhat, because some fields are still being validated by launch_slot_with_task)
  • Apply format (chat template, rerank template, infill format)
  • Tokenize the prompt --> tokens are put into task.prompt_tokens

The slot will always take an array of tokens as input, saved into slot.prompt_tokens

TODO

  • Add test case for infill --> use stories260K with added FIM tokens
  • Remove redundant ctx_server.tokenize function
  • Do not throw error on empty prompt
  • Update readme
  • (maybe) rename SERVER_TASK_TYPE_COMPLETION to _INFERENCE to better reflect what it does

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 23, 2024

@ggerganov Could you please share some curl commands that you used for testing /infill api? I would like to add them as test cases. Thank you.

@ggerganov
Copy link
Owner

Here is a simple test that verifies that "input_extra" is used during /infill:

curl \
    --silent --no-buffer --request POST \
    --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_extra": [{"filename": "llama.h", "text": "LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n"}], "input_suffix": "}\n", "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = ", "prompt": "", "top_k": 1, "stop": ["\n"]}' | jq
{
  "content": "llama_n_threads(NULL);",
  ...
}

Not sure what would be the smallest FIM model that this work work with. I've tested with Qwen2.5 1.5B, but it might be too big for the server tests script. If you can figure out a way to do it, would be very useful to test the /infill functionality.

In any case, I'm planning to add similar tests to ci/run.sh where we can afford to run a bigger model in the range of 1-2B.

@github-actions github-actions bot added the python python script changes label Oct 24, 2024
@ngxson
Copy link
Collaborator Author

ngxson commented Oct 24, 2024

I ended up add FIM tokens to the existing stories260K to make it compatible with /infill endpoint, so that it can be used in tests. In the future, we could also fine tune SmolLM 360M to allow it to infill the code (as the base model is already trained on coding problems)

I ran the same test on both master and this PR, and obtained the same result.

One thing that I noticed while testing, seems like extra_tokens is not correctly added to the prompt_tokens in some cases. For example:

./llama-server -m ../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf -c 4096 -np 2 --verbose

And send the request mentioned in your last message (in my case, with n_predict = 2)

{
    "input_extra": [
        {
            "filename": "llama.h",
            "text": "LLAMA_API int32_t llama_n_threads();\n"
        }
    ],
    "input_suffix": "}\n",
    "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_",
    "prompt": "",
    "temperature": 0,
    "seed": 42,
    "n_predict": 2
}

Then observe the formatted prompt (please note that, slot.prompt is removed in this PR; the prompt in the response below is detokenize(slot.prompts_token), which should be more useful for debugging)

{
    "content": "get_num",
    "id_slot": 0,
    "stop": true,
    "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
    "tokens_predicted": 2,
    "tokens_evaluated": 27,
    "generation_settings": {
        "n_ctx": 2048,
        "n_predict": -1,
        "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
        "seed": 42,
        "seed_cur": 42,
        "temperature": 0.0,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "tfs_z": 1.0,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "penalize_nl": false,
        "stop": [],
        "max_tokens": 2,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "",
        "samplers": [
            "top_k",
            "tfs_z",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ]
    },
    "prompt": "filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>",
    "has_new_line": false,
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 28,
    "timings": {
        ...
    },
    "index": 0
}

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR. Maybe you can fix it in another PR @ggerganov

@ngxson ngxson marked this pull request as ready for review October 24, 2024 14:43
@ngxson ngxson requested a review from ggerganov October 24, 2024 14:43
@ggerganov
Copy link
Owner

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR

Yes, this logic seems to have issues - thank you for noticing this. I will fix this in a follow up PR.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really well done 👍

@wwoodsTM
Copy link
Contributor

@ngxson Just wanted to add that I really appreciate you integrating such a robust way to deal with the different kinds of prompts that are possible. I am not sure what you may have already been planning around this but if somehow my comments in the other PR about how important versatility was to me in my own situation helped inspire any of your ideas here, then I am honored that you would so rapidly incorporate that. Either way, the effort is very appreciated. Thank you!

@ngxson ngxson merged commit 958367b into ggerganov:master Oct 24, 2024
54 checks passed
@chrisstankevitz
Copy link

chrisstankevitz commented Oct 31, 2024

I suspect/wonder_if this broke #7728. Previously server_context::get_available_slot attempted to find a slot whose prompt is similar to the task prompt common_part(slot_prompt, task_prompt). Now it compares the slot's prompt to itself and doesn't even use the task prompt: longest_common_prefix(slot.cache_tokens, slot.prompt_tokens).

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 31, 2024

Now it compares the slot's prompt to itself and doesn't even use the task prompt

@chrisstankevitz slot.prompt_tokens contains tokens from the input prompt, so I don't understand what's the problem here. Can you elaborate?

(I would appreciate having logs and reproducible steps)

@chrisstankevitz
Copy link

chrisstankevitz commented Oct 31, 2024

@ngxson you are correct that slot.prompt_tokens contains the tokens from the input prompt -- those are the tokens from the "prior" input prompt -- the prompt that was most recently used on that slot. But there are another set of tokens that need to be compared -- the tokens from the new/current "task prompt".

The idea behind the logic is to:

  1. Take the "task prompt" (the prompt from the new request)
  2. Compare the "task prompt" to each "slot prompt".
  3. Identify the slot whose [prior] "slot prompt" is most similar to the [new/current] "task prompt"
  4. Prefer to use the slot identified in (3)

Original Pseudo-Code

for (slot : slots)
{
  ...
  common_part(slot.prompt, task,prompt)
  ...
}

New Pseudo-Code (possibly wrong)

for (slot : slots)
{
  ...
  longest_common_prefix(slot.prompt_tokens, slot.cache_tokens)
  ...
}

You can see that the original code was comparing 1) the [old] slot prompt to 2) the [new] task prompt.

But after the change, the code is comparing 1) the [old] slot prompt tokens to 2) the slot's "cache_tokens". This is incorrect: should not be comparing the slot's prompt-tokens to same slot's "cache_tokens". Instead the logic should be comparing the [old] slot's prompt-tokens to the [new] task prompt-tokens.

New Pseudo-Code (probably correct)

for (slot : slots)
{
  ...
  longest_common_prefix(slot.prompt_tokens, task.prompt_tokens)
  ...
}

Another clue that something is amiss: the function server_context::get_available_slot not longer uses the task prompt (which is passed as input). The "task prompt" is critical to this "common prefix" logic and it should be getting used.

I do not use this logic, so I do not have an example... I was just reviewing recent changes to server.cpp. And it didn't look right to me.

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 31, 2024

OK I see, you are correct that my change make it to compare between the input prompt_tokens and cached_tokens, but not input prompt_tokens and the prior prompt_tokens from the last run.

But in fact, that was my intent to do so.

The whole reason why we want to check longest_common_prefix(prompt, cached_tokens) is because we want to reuse cached tokens. Comparing new vs prior prompt_tokens won't make sense because some tokens in the old prompt may not be in the cache (i.e. due to context shifting or prompt truncating). Indeed, in some cases, this negatively impact the performance.

I also want to note that, the longest_common_prefix ma be out-dated because of #9866 . Basically, before, we can only reuse caches from prompts having same prefix. But now, we can reuse different chunks in the prompt, so we may need something like longest_common_chunk in the future.

@chrisstankevitz
Copy link

chrisstankevitz commented Oct 31, 2024

I might be misunderstanding, but it sounds like you are saying it is your intent to ignore the [new] task's prompt tokens. If so, I'm afraid that you are misunderstanding the point of #7728 which is to find an [old] slot that is similar to the [new] task. But if you insist on ignoring the task's prompt, then you should at least remove the task's prompt from the argument list of server_context::get_available_slot -- because it is not used anymore.

@chrisstankevitz
Copy link

@sasha0552 would you please comment on this?

// length of the Longest Common Prefix between the current slot's prompt and the input prompt
int lcp_len = longest_common_prefix(slot_prompt, prompt);
int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens);
Copy link
Collaborator Author

@ngxson ngxson Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sorry I see now. Indeed, I think the old code is quite confusing (and that's why my new version is incorrect).

The old code is:

std::string slot_prompt = slot.prompt.get<std::string>();
int lcp_len = longest_common_prefix(slot_prompt, prompt);

So when refactor I thought that slot_prompt is indeed task.prompt

The fix would be (as you said) to change the second argument to task.prompt_tokens. Do you want to make a PR @chrisstankevitz ?

Suggested change
int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens);
int lcp_len = longest_common_prefix(slot.cache_tokens, task.prompt_tokens);

Copy link
Collaborator Author

@ngxson ngxson Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please note that here we compare:

slot.cache_tokens, task.prompt_tokens

But not

slot.prompt_tokens, task.prompt_tokens

Because some tokens stored in the past prompt_tokens may not be in cache

@sasha0552
Copy link
Contributor

sasha0552 commented Nov 1, 2024

slot.cache_tokens, task.prompt_tokens

As I see it, task.prompt_tokens is just a tokenized prompt directly from the prompt, so it should be fine.

Comparing slot.cache_tokens, slot.prompt_tokens is indeed pointless.

I also want to note that, the longest_common_prefix ma be out-dated because of #9866 . Basically, before, we can only reuse caches from prompts having same prefix. But now, we can reuse different chunks in the prompt, so we may need something like longest_common_chunk in the future.

After #9866, longest common prefix algorithm should be replaced by longest common substring algorithm at least (or perhaps something more complex). In #7728 I intended to use LCS initially precisely because of the possibility (in the future) to reuse computed tokens in the middle.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…rganov#10023)

* server : refactor slot input data, move tokenizer to HTTP thread

* move prompt_tokens.empty() check

* fix incorrect if branch

* fix infinite generation loop

* bring back infill validation

* add infill test

* try fixing format_infill

* fix test

* remove redundant code

* rename completion to inference

* update docs

* use llama_tokens everywhere
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…rganov#10023)

* server : refactor slot input data, move tokenizer to HTTP thread

* move prompt_tokens.empty() check

* fix incorrect if branch

* fix infinite generation loop

* bring back infill validation

* add infill test

* try fixing format_infill

* fix test

* remove redundant code

* rename completion to inference

* update docs

* use llama_tokens everywhere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants