server : simplify state machine for slot #9283

ngxson · 2024-09-02T20:18:19Z

Currently, state of a slot is actually controlled by 2 variables: slot.state (2 possible values) and slot.command (3 possible values). This makes the total number of states become 2x3=6, but some of them are in fact invalid states.

This PR aims to simplify state machine of slot by unifying them into 4 states.

The state machine can be represent using the graph below:

graph TD;
    SLOT_STATE_IDLE-- new task -->SLOT_STATE_PROCESSING_PROMPT;
    SLOT_STATE_PROCESSING_PROMPT-- decode prompt -->SLOT_STATE_PROCESSING_PROMPT;
    SLOT_STATE_PROCESSING_PROMPT-- done processing prompt -->SLOT_STATE_DONE_PROMPT;
    SLOT_STATE_DONE_PROMPT-- is embedding -->SLOT_STATE_IDLE;
    SLOT_STATE_DONE_PROMPT-- is next-token prediction -->SLOT_STATE_GENERATING;
    SLOT_STATE_GENERATING-- decode next token -->SLOT_STATE_GENERATING;
    SLOT_STATE_GENERATING-- stop condition -->SLOT_STATE_IDLE;

I have read the contributing guidelines
Self-reported review complexity:
- Low

ngxson · 2024-09-03T09:04:33Z

~~Hmm this seems to break context self-extend (passkey test - it's not run on CI but I run it locally)~~

Actually I got timeout error locally, due to the time it takes to download the model from internet (we use Phi-2 model for passkey test)

ngxson · 2024-09-03T10:35:26Z

Benchmark with -np 8 --http-threads 32 -dt 0.05 and UVs = 16

master (with n_decode_total metric added):

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 3100
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 7.77194
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 861.867
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 23.0447

tgi_load_test-1  |      input_tokens...................: 154207  720.035006/s
tgi_load_test-1  |      iteration_duration.............: avg=6.22s    min=1.53s    med=4.19s    max=53.7s    p(90)=7.35s    p(95)=8.03s   
tgi_load_test-1  |      iterations.....................: 500     2.334638/s
tgi_load_test-1  |      new_tokens.....................: 24061   112.347444/s
tgi_load_test-1  |      time_per_token.................: avg=116.1ms  min=33.01ms  med=65.47ms  max=1.96s    p(90)=132.35ms p(95)=170.77ms
tgi_load_test-1  |      tokens.........................: 178268  832.38245/s

PR:

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 3030
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 7.94323
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 1140.16
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 20.4767

tgi_load_test-1  |      input_tokens...................: 154207  719.643964/s
tgi_load_test-1  |      iteration_duration.............: avg=6.23s    min=1.46s    med=4.28s    max=49.19s   p(90)=8.09s    p(95)=10.84s  
tgi_load_test-1  |      iterations.....................: 500     2.33337/s
tgi_load_test-1  |      new_tokens.....................: 24044   112.207095/s
tgi_load_test-1  |      time_per_token.................: avg=120.57ms min=34.58ms  med=66.86ms  max=3.02s    p(90)=159.28ms p(95)=215.28ms
tgi_load_test-1  |      tokens.........................: 178251  831.851058/s

ngxson · 2024-09-03T11:20:13Z

@ggerganov FYI, this PR also fixed the point about notify_slot_changed() that I brought up yesterday:

I think notify_slot_changed() is probably a bottleneck. This function is used to bring back deferred tasks into main queue. However, because it is called inside update_slots where the slots are already set, this will make the slot to be idle during the current iteration, while in reality, the slot can take new task right away. (In other words, we bring back deferred tasks without processing them right away)

As you see on the benchmark above, n_busy_slots_per_decode goes up from 7.7 to 7.9 with this PR. The consequence is that prompt processing tok/s increases (as prompts from new request appears more often in the batch), but the downside is that generation tok/s decreases. I'm testing on 1xA10G, but the result maybe different on better hardware.

ggerganov · 2024-09-03T11:25:55Z

Is it normal for the input_tokens to be different in the 2 tests?

ngxson · 2024-09-03T11:57:38Z

Is it normal for the input_tokens to be different in the 2 tests?

No, it should not. I re-ran the test and I updated the result, so input_tokens is now matched between the 2 tests.

(I was having some network errors during the initial run, so that's why it had less tokens)

slaren

Looks good to me, but I don't really know the server code well enough to review this.

ggerganov · 2024-09-06T06:21:29Z

According to 040fdde there was an address sanitizer issue. AFAICT the commit should not make a difference, except avoiding the copy of the task. Is the address sanitizer problem still present and how to reproduce?

ngxson · 2024-09-06T09:20:32Z

@ggerganov Thanks for pointing that out. You're right, indeed, there was a missing lock in post(std::vector<server_task> & tasks). The function is added during the refactoring of multitask. This makes the CI to randomly fail, so I'll make a dedicated PR to fix that.

Beside, I changed queue_tasks.erase(queue_tasks.begin()) to queue_tasks.pop_front() since queue is now std::deque. Not sure if it changes anything, but I re-ran the CI multiple times and it all passed:

examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson · 2024-09-06T21:20:49Z

To make sure that it's not a fluke, I re-ran the CI 5 times:

So I can assume that this is safe to merge now.

* server : simplify state machine for slot * add SLOT_STATE_DONE_PROMPT * pop_deferred_task * add missing notify_one * fix passkey test * metrics : add n_busy_slots_per_decode * fix test step * add test * maybe fix AddressSanitizer? * fix deque ? * missing lock * pop_deferred_task: also notify * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server : simplify state machine for slot

2c81cde

github-actions bot added examples server labels Sep 2, 2024

ngxson added 3 commits September 2, 2024 22:31

add SLOT_STATE_DONE_PROMPT

446d57d

pop_deferred_task

ec882cc

add missing notify_one

d3fedaa

ngxson added 2 commits September 3, 2024 11:54

fix passkey test

fbebf65

metrics : add n_busy_slots_per_decode

69b398c

github-actions bot added the python python script changes label Sep 3, 2024

ngxson added 3 commits September 3, 2024 12:54

fix test step

e8e3e72

Merge branch 'master' into xsn/slot_state_machine

ba0065f

add test

852f654

ngxson marked this pull request as ready for review September 3, 2024 11:15

ngxson requested a review from ggerganov September 3, 2024 11:15

ngxson requested a review from slaren September 3, 2024 11:21

maybe fix AddressSanitizer?

040fdde

slaren reviewed Sep 5, 2024

View reviewed changes

ngxson added 2 commits September 6, 2024 10:22

fix deque ?

2ab3da6

missing lock

e9313f2

pop_deferred_task: also notify

dd7e853

ggerganov approved these changes Sep 6, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

ngxson and others added 2 commits September 6, 2024 14:06

Update examples/server/server.cpp

38b14cd

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Merge branch 'master' into xsn/slot_state_machine

e0fdcda

ngxson merged commit 9b2c24c into ggerganov:master Sep 6, 2024
53 checks passed

PyroGenesis mentioned this pull request Oct 25, 2024

Llama server - Update doc for slot states #10053

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : simplify state machine for slot #9283

server : simplify state machine for slot #9283

ngxson commented Sep 2, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading

ggerganov commented Sep 3, 2024

ngxson commented Sep 3, 2024 •

edited

Loading

slaren left a comment

ggerganov commented Sep 6, 2024

ngxson commented Sep 6, 2024

ngxson commented Sep 6, 2024

server : simplify state machine for slot #9283

server : simplify state machine for slot #9283

Conversation

ngxson commented Sep 2, 2024 • edited Loading

ngxson commented Sep 3, 2024 • edited Loading

ngxson commented Sep 3, 2024 • edited Loading

ngxson commented Sep 3, 2024 • edited Loading

ggerganov commented Sep 3, 2024

ngxson commented Sep 3, 2024 • edited Loading

slaren left a comment

Choose a reason for hiding this comment

ggerganov commented Sep 6, 2024

ngxson commented Sep 6, 2024

ngxson commented Sep 6, 2024

ngxson commented Sep 2, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading

ngxson commented Sep 3, 2024 •

edited

Loading