-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : refactor multitask handling #9274
server : refactor multitask handling #9274
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, one thing that we can try to optimize is to perform the tokenization prior to enqueing the task, so that it is multi-threaded by the http threads and happens in parallel to the GPU computation. Currently, the tokenization happens inside the main loop so it is single-threaded, inbetween llama_decode
calls.
Last time I looked into this, it got a bit tricky. But we can try to revisit it.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ggerganov Thanks for reviewing this PR. FYI, currently |
If the tokenization functions are thread-safe, this needs to be clearly documented in the llama.cpp API. The general expectation is that it is thread-safe to use different objects in multiple threads, but only one thread should use the same object. |
(Related to current PR) I've finished some benchmarks to confirm that there is no performance lost due to this PR, so I will merge it soon to start working on the next item. |
* server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Motivation
The server example has grown up in term of functionalities. I think it would be nice to start looking at how to simplify some parts and to make the code more manage-able.
The goal here is to make the http part and engine part more separated, so they can easily be debugged and probably optimized in the future.
The current architecture contains of 5 main parts:
llama_decode
In this PR
Currently, multitask is added as a "super-task" that linked to other smaller tasks. However, this introduce some complexity to the task queue level. I believe that moving it to HTTP handler will simplify the task queue.
On the way doing so, I also made some small refactoring to
struct server_task
and HTTP handlers, for example:enum server_task_cmpl_type
for completion task type: normal, embedding, infillFuture works
I also take time to dive deeper into httplib source code to see if there are anything else we can take advantage of. Turns out we can maybe even get rid of the notion of
deferred
tasks (in case there is no free slots to process the incoming request). I will investigate this more in another issue/PR.In addition, one small issue regarding to cancelling task also be address. However, no fix is being planned for now.