Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : simplify state machine for slot #9283

Merged
merged 15 commits into from
Sep 6, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Sep 2, 2024

Currently, state of a slot is actually controlled by 2 variables: slot.state (2 possible values) and slot.command (3 possible values). This makes the total number of states become 2x3=6, but some of them are in fact invalid states.

This PR aims to simplify state machine of slot by unifying them into 4 states.

The state machine can be represent using the graph below:

graph TD;
    SLOT_STATE_IDLE-- new task -->SLOT_STATE_PROCESSING_PROMPT;
    SLOT_STATE_PROCESSING_PROMPT-- decode prompt -->SLOT_STATE_PROCESSING_PROMPT;
    SLOT_STATE_PROCESSING_PROMPT-- done processing prompt -->SLOT_STATE_DONE_PROMPT;
    SLOT_STATE_DONE_PROMPT-- is embedding -->SLOT_STATE_IDLE;
    SLOT_STATE_DONE_PROMPT-- is next-token prediction -->SLOT_STATE_GENERATING;
    SLOT_STATE_GENERATING-- decode next token -->SLOT_STATE_GENERATING;
    SLOT_STATE_GENERATING-- stop condition -->SLOT_STATE_IDLE;
Loading

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 3, 2024

Hmm this seems to break context self-extend (passkey test - it's not run on CI but I run it locally)

Actually I got timeout error locally, due to the time it takes to download the model from internet (we use Phi-2 model for passkey test)

@github-actions github-actions bot added the python python script changes label Sep 3, 2024
@ngxson
Copy link
Collaborator Author

ngxson commented Sep 3, 2024

Benchmark with -np 8 --http-threads 32 -dt 0.05 and UVs = 16

master (with n_decode_total metric added):

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 3100
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 7.77194
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 861.867
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 23.0447

tgi_load_test-1  |      input_tokens...................: 154207  720.035006/s
tgi_load_test-1  |      iteration_duration.............: avg=6.22s    min=1.53s    med=4.19s    max=53.7s    p(90)=7.35s    p(95)=8.03s   
tgi_load_test-1  |      iterations.....................: 500     2.334638/s
tgi_load_test-1  |      new_tokens.....................: 24061   112.347444/s
tgi_load_test-1  |      time_per_token.................: avg=116.1ms  min=33.01ms  med=65.47ms  max=1.96s    p(90)=132.35ms p(95)=170.77ms
tgi_load_test-1  |      tokens.........................: 178268  832.38245/s

PR:

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 3030
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 7.94323
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 1140.16
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 20.4767

tgi_load_test-1  |      input_tokens...................: 154207  719.643964/s
tgi_load_test-1  |      iteration_duration.............: avg=6.23s    min=1.46s    med=4.28s    max=49.19s   p(90)=8.09s    p(95)=10.84s  
tgi_load_test-1  |      iterations.....................: 500     2.33337/s
tgi_load_test-1  |      new_tokens.....................: 24044   112.207095/s
tgi_load_test-1  |      time_per_token.................: avg=120.57ms min=34.58ms  med=66.86ms  max=3.02s    p(90)=159.28ms p(95)=215.28ms
tgi_load_test-1  |      tokens.........................: 178251  831.851058/s

@ngxson ngxson marked this pull request as ready for review September 3, 2024 11:15
@ngxson
Copy link
Collaborator Author

ngxson commented Sep 3, 2024

@ggerganov FYI, this PR also fixed the point about notify_slot_changed() that I brought up yesterday:

I think notify_slot_changed() is probably a bottleneck. This function is used to bring back deferred tasks into main queue. However, because it is called inside update_slots where the slots are already set, this will make the slot to be idle during the current iteration, while in reality, the slot can take new task right away. (In other words, we bring back deferred tasks without processing them right away)

As you see on the benchmark above, n_busy_slots_per_decode goes up from 7.7 to 7.9 with this PR. The consequence is that prompt processing tok/s increases (as prompts from new request appears more often in the batch), but the downside is that generation tok/s decreases. I'm testing on 1xA10G, but the result maybe different on better hardware.

@ngxson ngxson requested a review from slaren September 3, 2024 11:21
@ggerganov
Copy link
Owner

Is it normal for the input_tokens to be different in the 2 tests?

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 3, 2024

Is it normal for the input_tokens to be different in the 2 tests?

No, it should not. I re-ran the test and I updated the result, so input_tokens is now matched between the 2 tests.

(I was having some network errors during the initial run, so that's why it had less tokens)

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I don't really know the server code well enough to review this.

@ggerganov
Copy link
Owner

According to 040fdde there was an address sanitizer issue. AFAICT the commit should not make a difference, except avoiding the copy of the task. Is the address sanitizer problem still present and how to reproduce?

@ngxson
Copy link
Collaborator Author

ngxson commented Sep 6, 2024

@ggerganov Thanks for pointing that out. You're right, indeed, there was a missing lock in post(std::vector<server_task> & tasks). The function is added during the refactoring of multitask. This makes the CI to randomly fail, so I'll make a dedicated PR to fix that.

Beside, I changed queue_tasks.erase(queue_tasks.begin()) to queue_tasks.pop_front() since queue is now std::deque. Not sure if it changes anything, but I re-ran the CI multiple times and it all passed:

examples/server/server.cpp Outdated Show resolved Hide resolved
ngxson and others added 2 commits September 6, 2024 14:06
@ngxson
Copy link
Collaborator Author

ngxson commented Sep 6, 2024

To make sure that it's not a fluke, I re-ran the CI 5 times:

image

So I can assume that this is safe to merge now.

@ngxson ngxson merged commit 9b2c24c into ggerganov:master Sep 6, 2024
53 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* server : simplify state machine for slot

* add SLOT_STATE_DONE_PROMPT

* pop_deferred_task

* add missing notify_one

* fix passkey test

* metrics : add n_busy_slots_per_decode

* fix test step

* add test

* maybe fix AddressSanitizer?

* fix deque ?

* missing lock

* pop_deferred_task: also notify

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants