-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Prefix assistant answer #11536
Comments
right now the workaround is to use the new |
Great! With this new Is there an equivalent import requests
def apply_template():
url = "http://localhost:8080/apply-template"
prefix = "```go\nfunc quacksort"
data = {
"messages": [
{"role": "system", "content": "Only provide code. Do not write explanations."},
{"role": "user", "content": "Implement quicksort."},
],
}
with requests.post(url, json=data) as response:
prompt = response.json()["prompt"]
data = {
"prompt": prompt + prefix,
"seed": 0,
}
url = "http://localhost:8080/completion"
with requests.post(url, json=data) as response:
content = prefix + response.json()["content"]
print(content)
if __name__ == "__main__":
apply_template() |
The templating system used by the models doesn't support parsing. It's not llama.cpp's fault.
|
+1 for this - not supporting prefix in |
The feature already exists in the form of custom GBNF grammars! |
Great! It works! import requests
url = "http://localhost:8080/v1/chat/completions"
def prefix_using_grammar():
prefix = "```go\nfunc quacksort"
data = {
"messages": [
{"role": "system", "content": "Only provide code. Do not write explanations."},
{"role": "user", "content": "Implement quicksort."},
],
"grammar": f'root ::= "{prefix}" .*', # <---------- this line here is new
"seed": 0,
}
with requests.post(url, json=data) as response:
content = response.json()["choices"][0]["message"]["content"]
print(content)
if __name__ == "__main__":
prefix_using_grammar() All that is required is to add the grammar to the data object: data = {
...
"grammar": f'root ::= "{prefix}" .*',
} For me, this is good enough, but I wonder whether EDIT: I tested this a bit and I think there is an optimization missing: Sequences of consecutive tokens which are uniquely determined should be batch-computed. The performance makes me think that they are computed sequentially. |
Prerequisites
Feature Description
Mistral's API allows to prefix the answer of the assistant with a specified string. Excerpt from the documentation:
This makes it so that the next answer by the assistant starts with the given prefix.
Motivation
The option to prefix the assistant's prompt gives a great deal of control over the generation of the model while being much simpler to use than the alternatives.
For example, to force the model to answer directly with code in Java with a specific function signature, the prefix could be
"```java\nint add(int x, int y){"
. This technique is used to generate code for benchmarks such as HumanEval to prevent the models from going of the rails.Possible Implementation
A full usage example could look something like this:
(I used the qwen2.5-coder-7b-instruct-q3_k_m model:
llama-server --model qwen2.5-coder-7b-instruct-q3_k_m.gguf --host 127.0.0.1 --port 8080
)The expected result can be obtained with the raw completion API, but this is not portable from model to model since it requires knowledge of the prompt format, is more complicated and generally error prone since a single misplaced white space or line break can have significant impact on the generation quality.
The text was updated successfully, but these errors were encountered: