Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Add "custom" chat template that uses input_prefix and input_suffix #10425

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -852,15 +852,15 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.input_prefix = value;
params.enable_chat_template = false;
}
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_INFILL}));
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_INFILL}));
add_opt(common_arg(
{"--in-suffix"}, "STRING",
"string to suffix after user inputs with (default: empty)",
[](common_params & params, const std::string & value) {
params.input_suffix = value;
params.enable_chat_template = false;
}
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_INFILL}));
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_INFILL}));
add_opt(common_arg(
{"--no-warmup"},
"skip warming up the model with an empty run",
Expand Down
15 changes: 10 additions & 5 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2829,7 +2829,7 @@ int main(int argc, char ** argv) {
return;
}

json data = oaicompat_completion_params_parse(ctx_server.model, json::parse(req.body), params.chat_template);
json data = oaicompat_completion_params_parse(ctx_server.model, json::parse(req.body), params.chat_template, params.input_prefix, params.input_suffix);

std::vector<server_task> tasks = ctx_server.create_tasks_inference(data, SERVER_TASK_INF_TYPE_COMPLETION);
ctx_server.queue_results.add_waiting_tasks(tasks);
Expand Down Expand Up @@ -3220,15 +3220,20 @@ int main(int argc, char ** argv) {

// if a custom chat template is not supplied, we will use the one that comes with the model (if any)
if (params.chat_template.empty()) {
if (!ctx_server.validate_model_chat_template()) {
if (!params.input_prefix.empty() || !params.input_suffix.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we can refactor a bit:

  • if both chat_template, suffix/prefix are set, we throw an error says "only template or suffix/prefix can be set, but not both"
  • if chat_template is set, we use it
  • if suffix/prefix are set, we use it and discard chat_template
  • if none is specified, we use model's default template

In either cases, it would be nice to output a test formatted chat here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • if both chat_template, suffix/prefix are set, we throw an error says "only template or suffix/prefix can be set, but not both"

I think that prioritizing chat template and ignoring prefix/suffix is enough - and we can replicate the same in UI by hiding or dimming prefix/suffix input fields if "custom" is not selected.

Again, I'm not sure about leaving this in a weird noname state and only using prefix/suffix to define that custom template is used. Clearly stating what it is would be better, I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the UI, I think we can go with something simple for now: I'd suggest simply have a checkbox says "use custom chat template" and if it's checked, we show prefix/suffix input. No dropdown is needed.

Let's skip the ability to select named templates like "llama3", "mistral", etc for now, because it's difficult to maintain 2 lists of template, one in llama.cpp and another in index.html

LOG_WRN("%s: Prefix and suffix are used instead of a chat template. This may cause the model to output suboptimal responses\n", __func__);
params.chat_template = "custom";
} else if (!ctx_server.validate_model_chat_template()) {
LOG_WRN("%s: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses\n", __func__);
params.chat_template = "chatml";
}
} else if (!params.input_prefix.empty() || !params.input_suffix.empty()) {
LOG_WRN("%s: Prefix and suffix are not used because a chat template is defined.\n", __func__);
} else {
// print sample chat example to make it clear which template is used
LOG_INF("%s: chat template, built_in: %d, chat_example: '%s'\n", __func__, params.chat_template.empty(), common_chat_format_example(ctx_server.model, params.chat_template).c_str());
}

// print sample chat example to make it clear which template is used
LOG_INF("%s: chat template, built_in: %d, chat_example: '%s'\n", __func__, params.chat_template.empty(), common_chat_format_example(ctx_server.model, params.chat_template).c_str());

ctx_server.queue_tasks.on_new_task(std::bind(
&server_context::process_single_task, &ctx_server, std::placeholders::_1));

Expand Down
19 changes: 14 additions & 5 deletions examples/server/utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -300,8 +300,9 @@ static llama_tokens format_infill(
}

// Format given chat. If tmpl is empty, we take the template from model metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we can leave a comment on how and when prefix/suffix is used

inline std::string format_chat(const struct llama_model * model, const std::string & tmpl, const std::vector<json> & messages) {
inline std::string format_chat(const struct llama_model * model, const std::string & tmpl, const std::string & prefix, const std::string & suffix, const std::vector<json> & messages) {
std::vector<common_chat_msg> chat;
std::string formatted_chat;

for (size_t i = 0; i < messages.size(); ++i) {
const auto & curr_msg = messages[i];
Expand All @@ -325,10 +326,16 @@ inline std::string format_chat(const struct llama_model * model, const std::stri
throw std::runtime_error("Missing 'content' (ref: https://github.com/ggerganov/llama.cpp/issues/8367)");
}

chat.push_back({role, content});
if (tmpl == "custom") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do this too?

Suggested change
if (tmpl == "custom") {
bool is_custom = !prefix.empty() || !suffix.empty();
if (is_custom) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this - using "custom" as a defined template name helps with selection in UI too. Is there a reason against it that I'm missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI logic and underlay logic should be decoupled. If you have "custom" here, it become redundant because you already had bool is_custom = !prefix.empty() || !suffix.empty(), so kinda 2 variables control the same state.

// simple format using prefix and suffix
if (role == "user") formatted_chat += prefix + content + suffix;
else formatted_chat += content;
} else {
chat.push_back({role, content});
}
}

const auto formatted_chat = common_chat_apply_template(model, tmpl, chat, true);
if (tmpl != "custom") formatted_chat = common_chat_apply_template(model, tmpl, chat, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then:

Suggested change
if (tmpl != "custom") formatted_chat = common_chat_apply_template(model, tmpl, chat, true);
if (!is_custom) formatted_chat = common_chat_apply_template(model, tmpl, chat, true);

LOG_DBG("formatted_chat: '%s'\n", formatted_chat.c_str());

return formatted_chat;
Expand Down Expand Up @@ -597,13 +604,15 @@ static bool server_sent_event(httplib::DataSink & sink, const char * event, cons
static json oaicompat_completion_params_parse(
const struct llama_model * model,
const json & body, /* openai api json semantics */
const std::string & chat_template) {
const std::string & chat_template,
const std::string & input_prefix,
const std::string & input_suffix) {
json llama_params;

llama_params["__oaicompat"] = true;

// Apply chat template to the list of messages
llama_params["prompt"] = format_chat(model, chat_template, body.at("messages"));
llama_params["prompt"] = format_chat(model, chat_template, input_prefix, input_suffix, body.at("messages"));

// Handle "stop" field
if (body.contains("stop") && body.at("stop").is_string()) {
Expand Down
Loading