Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : refactor multitask handling #9274

Merged
merged 10 commits into from
Sep 2, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Sep 2, 2024

Motivation

The server example has grown up in term of functionalities. I think it would be nice to start looking at how to simplify some parts and to make the code more manage-able.

The goal here is to make the http part and engine part more separated, so they can easily be debugged and probably optimized in the future.

The current architecture contains of 5 main parts:

  • HTTP stack / handler
  • Task queue
  • Result queue
  • Task dispatcher: responsible of loading tasks into slots
  • Slot dispatcher: responsible of batching and calling llama_decode

In this PR

Currently, multitask is added as a "super-task" that linked to other smaller tasks. However, this introduce some complexity to the task queue level. I believe that moving it to HTTP handler will simplify the task queue.

On the way doing so, I also made some small refactoring to struct server_task and HTTP handlers, for example:

  • new enum server_task_cmpl_type for completion task type: normal, embedding, infill
  • reuse the same http handler for both infill and completion endpoint

Future works

I also take time to dive deeper into httplib source code to see if there are anything else we can take advantage of. Turns out we can maybe even get rid of the notion of deferred tasks (in case there is no free slots to process the incoming request). I will investigate this more in another issue/PR.

In addition, one small issue regarding to cancelling task also be address. However, no fix is being planned for now.


@github-actions github-actions bot added the python python script changes label Sep 2, 2024
@ngxson ngxson removed the python python script changes label Sep 2, 2024
@github-actions github-actions bot added the python python script changes label Sep 2, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, one thing that we can try to optimize is to perform the tokenization prior to enqueing the task, so that it is multi-threaded by the http threads and happens in parallel to the GPU computation. Currently, the tokenization happens inside the main loop so it is single-threaded, inbetween llama_decode calls.

Last time I looked into this, it got a bit tricky. But we can try to revisit it.

examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ngxson
Copy link
Collaborator Author

ngxson commented Sep 2, 2024

@ggerganov Thanks for reviewing this PR.

FYI, currently handle_completions can already handle string or list tokens as input, so I think it will be easy to implement tokenization inside http threads. I'll have a look on this in another PR.

@slaren
Copy link
Collaborator

slaren commented Sep 2, 2024

Btw, one thing that we can try to optimize is to perform the tokenization prior to enqueing the task, so that it is multi-threaded by the http threads and happens in parallel to the GPU computation. Currently, the tokenization happens inside the main loop so it is single-threaded, inbetween llama_decode calls.

If the tokenization functions are thread-safe, this needs to be clearly documented in the llama.cpp API. The general expectation is that it is thread-safe to use different objects in multiple threads, but only one thread should use the same object.

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 2, 2024

(Related to current PR) I've finished some benchmarks to confirm that there is no performance lost due to this PR, so I will merge it soon to start working on the next item.

@ngxson ngxson merged commit 6e7d133 into ggerganov:master Sep 2, 2024
53 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* server : remove multitask from server_task

* refactor completions handler

* fix embeddings

* use res_ok everywhere

* small change for handle_slots_action

* use unordered_set everywhere

* (try) fix test

* no more "mutable" lambda

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use deque

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* server : remove multitask from server_task

* refactor completions handler

* fix embeddings

* use res_ok everywhere

* small change for handle_slots_action

* use unordered_set everywhere

* (try) fix test

* no more "mutable" lambda

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use deque

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* server : remove multitask from server_task

* refactor completions handler

* fix embeddings

* use res_ok everywhere

* small change for handle_slots_action

* use unordered_set everywhere

* (try) fix test

* no more "mutable" lambda

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use deque

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants