-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: try to refactor server.cpp #5065
Conversation
examples/server/utils.hpp
Outdated
typedef std::function<void(int, int, task_result&)> callback_multitask_t; | ||
callback_multitask_t callback_update_multitask; | ||
// for keeping track of all tasks waiting for the result | ||
std::mutex mutex_task_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mutex is not needed.
In general, server
should never have more than 2 mutexes - one for input and one for output. If we add a 3rd one, then something is not right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks for the notice. I'll see what I can do to remove that mutex_task_ids
examples/server/server.cpp
Outdated
std::condition_variable condition_tasks; | ||
std::mutex mutex_results; | ||
std::condition_variable condition_results; | ||
std::mutex mutex_multitasks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring all synchronization and queues into utils.hpp
is OK, but the result should be that there are no longer any mutexes or locks in server.cpp
Can we avoid this somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, and I suspect that queue_multitasks
may not even need mutex.
The reason why we needed mutex was because queue_multitasks
is modified by both main thread and http thread (by calling llama.next_result
)
But my new implementation of llama.next_result
(moved to llama_server_response_event.recv
) does not touch queue_multitasks
anymore. Instead, the multitask queue is updated via llama_server_response_event.send
, which is always called by main thread.
But ideally, I want multitask & task to be in the same place, i.e. maybe add it to the new struct llama_server_queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multitasks is moved into llama_server_queue
in d87b48f and 1bd8678
I didn't have many idea to test it though, but seems to work just fine:
How I tested it: Send POST to http://localhost:8080/completion
Body:
{
"prompt": ["hi", "bonjour"],
"n_predict": 10
}
Response (expected 1 response in english and 1 in french):
{
"results": [
{
"content": ", 2018)\n\nSangam (2019",
...
},
{
"content": " le chôme-Ouest, b",
...
}
]
}
Another test using OAI /v1/chat/completions
and stream: true
also works fine.
I didn't have time to test embedding API, but I believe that it should work.
// copy aggregate results of complete multi-tasks to the results queue | ||
std::lock_guard<std::mutex> lock_results(mutex_results); | ||
queue_results.insert(queue_results.end(), agg_results.begin(), agg_results.end()); | ||
result.result_json = json{ { "results", result_jsons } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: I also correct the response data of /completion
endpoint (root should be an object result: ...
, not an array). I think because most people use the OAI endpoint instead, so no one else has noticed the bug 😄
} | ||
|
||
// Call when the state of one slot is changed | ||
void notify_slot_changed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optimized version of #5018 . Instead of pushing the deferred tasks back every iteration of main loop (brute force), here I only push them back when a slot change its state.
Early Load TestI have used two different tools with different load approaches strategies, with both I obtained over 40 successful requests over 120 seconds, with about ~0.27 request per second as throughout, using the same body as in ReadMe in all requests. Setup
Next Steps
## Raw Results Server setup:
Using K6:
Using Vegeta:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general comment about server
is that I always thought we lack extensive logs of what is happening during processing. We should probably add significantly more log messages so that it is easier to debug things and have a better understanding of the current server state. With time when the implementation stabilizes, we can reduce the logs or add verbosity levels
Just something to keep in mind for future PRs - no need to extend this one for now
examples/server/server.cpp
Outdated
const int task_id = llama.request_completion(data, false, false, -1); | ||
llama.queue_results.add_waiting_task_id(task_id); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, do we need to explicitly add waiting task ids like this - can it become automatic within request_completion
or more specifically inside the post()
method?
Main thing is that I'm worried about this pattern:
void add_waiting_task_id(int task_id) {
std::unique_lock<std::mutex> lock(mutex_results);
waiting_task_ids.insert(task_id);
}
This makes the waiting_task_ids
container atomic, but it hides a potential race condition where we haven't added the task id yet, but the receiver is already waiting for it.
Similar thoughts about remove_waiting_task_id
- I think it would be better to absorb it within recv()
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my testing, I ran into issue where user cancel the task (disconnect the HTTP stream) while the tasks are still being processed. The results are still being sent even when the socket is already closed (because the "cancel" task is not yet to be processed). The main idea of that waiting_task_ids
is to keep track if the HTTP stream is still alive or not.
I did try moving both add_waiting_task_id
and remove_waiting_task_id
into recv()
, but the problem is that in streaming mode, what if the result arrive while the HTTP server is busy sending data and that the next recv()
is not yet called? In that situation, I saw some results to be dropped.
About the race condition that you pointed out, I don't think that will happen. Basically:
- We always call
add_waiting_task_id
beforerecv
. This is always sequential remove_waiting_task_id
only get called right before the HTTP stream is closed, so no result is dropped as long as HTTP stream is alive- My idea is inspired by the idea of socket
accept
, then arecv
loop and finallyclose
Also, it worth to mention that one HTTP request is linked to one or more tasks, but one task is never be shared among HTTP requests. In the worst case, a result could not be picked up by any of the HTTP requests, it would remand in the result queue all the time. Therefore, I think it's still worth somehow keeping track of the state of HTTP connection in order not to accept "abandoned result".
But it's true that I may messed up somewhere so I saw some strange behavior. I think I'll revisit this part in another PR. What do you think about that @ggerganov ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be misreading something, but I think there is definitely a race:
thread 0 -> llama.request_completion()
thread 1 -> llama_server_response.send() <-- does nothing because no waiting tasks
thread 0 -> llama.queue_results.add_waiting_task_id()
thread 0 -> llama.queue_results.recv(task_id); <-- result is missed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah you're right, sorry I missed that point. Thanks for having noticed.
I noticed the same potential issue with split_multiprompt_task
, where I add all subtasks before adding the multitask, will definitely be fix.
Then my idea would be to generate the task ID using queue_tasks.get_next_id()
, then pass this id into request_completion
so that we can know the task_id
before adding it to the queue.
Another idea is to modify to add a param bool start = true
to queue_tasks.post()
. When calling queue_tasks.post(task, false)
, the task not be started right away. Until we call queue_tasks.start_task(task_id_or_multitask_id)
then the task (or tasks) will start. But this is quite a patch & also overkill I think.
What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved in 8f36df8
The request_completion
now requires passing an ID for the new task. When a new request arrives, things will be done in the order below (high-level view):
llama.queue_tasks.get_new_id()
llama.queue_results.add_waiting_task_id(task_id)
- Add the
task_multi
if needed - Finally, add all the
task_server
@RafaAguilar Nice result, thanks! I wonder if it's better for you to push your code into another PR. My idea is:
|
I think this is a good start
This is probably not really necessary at this point - the code is relatively small |
* server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"
* server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"
Motivation
The current server.cpp code is quite complicated to understand. This is my first trial to refactor it.
I don't even know if I'm heading in the right direction or not. I really need some helps and I'd appreciate all the suggestions and critics.
Here's what I succeeded to do so far:
oai.hpp
utils.hpp
queue_tasks
intollama_server_queue
interface, which will manage its own mutex (the idea it is not to expose any mutex to the outside of that class)queue_results
intollama_server_response_event
. My initial idea was to register one callback per HTTP stream, but I didn't success because theset_chunked_content_provider
function still being done in blocking way.My idea in the long term is:
llama_server_queue.on_all_tasks_finished