server : refactor multitask handling #9274

ngxson · 2024-09-02T10:22:42Z

Motivation

The server example has grown up in term of functionalities. I think it would be nice to start looking at how to simplify some parts and to make the code more manage-able.

The goal here is to make the http part and engine part more separated, so they can easily be debugged and probably optimized in the future.

The current architecture contains of 5 main parts:

HTTP stack / handler
Task queue
Result queue
Task dispatcher: responsible of loading tasks into slots
Slot dispatcher: responsible of batching and calling llama_decode

In this PR

Currently, multitask is added as a "super-task" that linked to other smaller tasks. However, this introduce some complexity to the task queue level. I believe that moving it to HTTP handler will simplify the task queue.

On the way doing so, I also made some small refactoring to struct server_task and HTTP handlers, for example:

new enum server_task_cmpl_type for completion task type: normal, embedding, infill
reuse the same http handler for both infill and completion endpoint

Future works

I also take time to dive deeper into httplib source code to see if there are anything else we can take advantage of. Turns out we can maybe even get rid of the notion of deferred tasks (in case there is no free slots to process the incoming request). I will investigate this more in another issue/PR.

In addition, one small issue regarding to cancelling task also be address. However, no fix is being planned for now.

I have read the contributing guidelines
Self-reported review complexity:
- Medium

ggerganov

Btw, one thing that we can try to optimize is to perform the tokenization prior to enqueing the task, so that it is multi-threaded by the http threads and happens in parallel to the GPU computation. Currently, the tokenization happens inside the main loop so it is single-threaded, inbetween llama_decode calls.

Last time I looked into this, it got a bit tricky. But we can try to revisit it.

examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson · 2024-09-02T12:37:09Z

@ggerganov Thanks for reviewing this PR.

FYI, currently handle_completions can already handle string or list tokens as input, so I think it will be easy to implement tokenization inside http threads. I'll have a look on this in another PR.

slaren · 2024-09-02T14:27:23Z

Btw, one thing that we can try to optimize is to perform the tokenization prior to enqueing the task, so that it is multi-threaded by the http threads and happens in parallel to the GPU computation. Currently, the tokenization happens inside the main loop so it is single-threaded, inbetween llama_decode calls.

If the tokenization functions are thread-safe, this needs to be clearly documented in the llama.cpp API. The general expectation is that it is thread-safe to use different objects in multiple threads, but only one thread should use the same object.

ngxson · 2024-09-02T14:47:20Z

(Related to current PR) I've finished some benchmarks to confirm that there is no performance lost due to this PR, so I will merge it soon to start working on the next item.

* server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson added 6 commits September 1, 2024 18:09

server : remove multitask from server_task

012d8d8

refactor completions handler

4a5dbd8

fix embeddings

9f56c17

use res_ok everywhere

588b4bb

small change for handle_slots_action

83249aa

use unordered_set everywhere

24329aa

ngxson requested a review from ggerganov September 2, 2024 10:22

github-actions bot added examples server labels Sep 2, 2024

(try) fix test

31a2d4a

github-actions bot added the python python script changes label Sep 2, 2024

ngxson removed the python python script changes label Sep 2, 2024

no more "mutable" lambda

fa22106

github-actions bot added the python python script changes label Sep 2, 2024

ggerganov approved these changes Sep 2, 2024

View reviewed changes

Apply suggestions from code review

27f2c14

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

use deque

86caa35

ngxson merged commit 6e7d133 into ggerganov:master Sep 2, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : refactor multitask handling #9274

server : refactor multitask handling #9274

ngxson commented Sep 2, 2024 •

edited

Loading

ggerganov left a comment

ngxson commented Sep 2, 2024

slaren commented Sep 2, 2024

ngxson commented Sep 2, 2024

server : refactor multitask handling #9274

server : refactor multitask handling #9274

Conversation

ngxson commented Sep 2, 2024 • edited Loading

Motivation

In this PR

Future works

ggerganov left a comment

Choose a reason for hiding this comment

ngxson commented Sep 2, 2024

slaren commented Sep 2, 2024

ngxson commented Sep 2, 2024

ngxson commented Sep 2, 2024 •

edited

Loading