-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : refactor slot input data, move tokenizer to HTTP thread #10023
server : refactor slot input data, move tokenizer to HTTP thread #10023
Conversation
@ggerganov Could you please share some curl commands that you used for testing |
Here is a simple test that verifies that "input_extra" is used during curl \
--silent --no-buffer --request POST \
--url http://127.0.0.1:8012/infill \
--header "Content-Type: application/json" \
--data '{"input_extra": [{"filename": "llama.h", "text": "LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n"}], "input_suffix": "}\n", "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = ", "prompt": "", "top_k": 1, "stop": ["\n"]}' | jq {
"content": "llama_n_threads(NULL);",
...
} Not sure what would be the smallest FIM model that this work work with. I've tested with Qwen2.5 1.5B, but it might be too big for the server tests script. If you can figure out a way to do it, would be very useful to test the In any case, I'm planning to add similar tests to |
I ended up add FIM tokens to the existing stories260K to make it compatible with I ran the same test on both One thing that I noticed while testing, seems like
And send the request mentioned in your last message (in my case, with {
"input_extra": [
{
"filename": "llama.h",
"text": "LLAMA_API int32_t llama_n_threads();\n"
}
],
"input_suffix": "}\n",
"input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_",
"prompt": "",
"temperature": 0,
"seed": 42,
"n_predict": 2
} Then observe the formatted prompt (please note that, {
"content": "get_num",
"id_slot": 0,
"stop": true,
"model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
"tokens_predicted": 2,
"tokens_evaluated": 27,
"generation_settings": {
"n_ctx": 2048,
"n_predict": -1,
"model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
"seed": 42,
"seed_cur": 42,
"temperature": 0.0,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"xtc_probability": 0.0,
"xtc_threshold": 0.10000000149011612,
"tfs_z": 1.0,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"penalize_nl": false,
"stop": [],
"max_tokens": 2,
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"samplers": [
"top_k",
"tfs_z",
"typ_p",
"top_p",
"min_p",
"xtc",
"temperature"
]
},
"prompt": "filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>",
"has_new_line": false,
"truncated": false,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": true,
"stopping_word": "",
"tokens_cached": 28,
"timings": {
...
},
"index": 0
} I suspect that there maybe something to do with |
Yes, this logic seems to have issues - thank you for noticing this. I will fix this in a follow up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really well done 👍
@ngxson Just wanted to add that I really appreciate you integrating such a robust way to deal with the different kinds of prompts that are possible. I am not sure what you may have already been planning around this but if somehow my comments in the other PR about how important versatility was to me in my own situation helped inspire any of your ideas here, then I am honored that you would so rapidly incorporate that. Either way, the effort is very appreciated. Thank you! |
I suspect/wonder_if this broke #7728. Previously |
@chrisstankevitz (I would appreciate having logs and reproducible steps) |
@ngxson you are correct that The idea behind the logic is to:
Original Pseudo-Code
New Pseudo-Code (possibly wrong)
You can see that the original code was comparing 1) the [old] slot prompt to 2) the [new] task prompt. But after the change, the code is comparing 1) the [old] slot prompt tokens to 2) the slot's "cache_tokens". This is incorrect: should not be comparing the slot's prompt-tokens to same slot's "cache_tokens". Instead the logic should be comparing the [old] slot's prompt-tokens to the [new] task prompt-tokens. New Pseudo-Code (probably correct)
Another clue that something is amiss: the function I do not use this logic, so I do not have an example... I was just reviewing recent changes to server.cpp. And it didn't look right to me. |
OK I see, you are correct that my change make it to compare between the input But in fact, that was my intent to do so. The whole reason why we want to check I also want to note that, the |
I might be misunderstanding, but it sounds like you are saying it is your intent to ignore the [new] task's prompt tokens. If so, I'm afraid that you are misunderstanding the point of #7728 which is to find an [old] slot that is similar to the [new] task. But if you insist on ignoring the task's prompt, then you should at least remove the task's prompt from the argument list of |
@sasha0552 would you please comment on this? |
// length of the Longest Common Prefix between the current slot's prompt and the input prompt | ||
int lcp_len = longest_common_prefix(slot_prompt, prompt); | ||
int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sorry I see now. Indeed, I think the old code is quite confusing (and that's why my new version is incorrect).
The old code is:
std::string slot_prompt = slot.prompt.get<std::string>();
int lcp_len = longest_common_prefix(slot_prompt, prompt);
So when refactor I thought that slot_prompt
is indeed task.prompt
The fix would be (as you said) to change the second argument to task.prompt_tokens
. Do you want to make a PR @chrisstankevitz ?
int lcp_len = longest_common_prefix(slot.cache_tokens, slot.prompt_tokens); | |
int lcp_len = longest_common_prefix(slot.cache_tokens, task.prompt_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please note that here we compare:
slot.cache_tokens, task.prompt_tokens
But not
slot.prompt_tokens, task.prompt_tokens
Because some tokens stored in the past prompt_tokens
may not be in cache
As I see it, Comparing
After #9866, longest common prefix algorithm should be replaced by longest common substring algorithm at least (or perhaps something more complex). In #7728 I intended to use LCS initially precisely because of the possibility (in the future) to reuse computed tokens in the middle. |
…rganov#10023) * server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere
…rganov#10023) * server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere
Motivation
Ref discussion: #9702 (comment)
The main motivation of this PR is to get rid of having
json prompt
as slot input data. Thejson
data format is quite dangerous and messy to work with, as we now have to support many input shapes:In addition, we're currently doing some post-processing (i.e. format chat template) at HTTP level, but some other are done in the inference thread (i.e. format prompt for rerank & infill)
In this PR
I tried moving things around and defining a pattern:
For HTTP thread, what it does:
launch_slot_with_task
)task.prompt_tokens
The
slot
will always take an array of tokens as input, saved intoslot.prompt_tokens
TODO
ctx_server.tokenize
functionSERVER_TASK_TYPE_COMPLETION
to_INFERENCE
to better reflect what it does