[V1] `AsyncLLM` Implementation #9826

robertgshaw2-neuralmagic · 2024-10-30T03:51:20Z

SUMMARY:

AsyncLLM in V1 - better overlapping of GPU and CPU

TODO:

FOLLOW UP PRS:

Benchmarking with CUDAGraphs (todo as follow up given cudagraphs are broken)
Robustness (health checks, make sure abort is working properly everywhere)
More AsyncLLM and LLMEngine tests (abort, stop string, other unit)
Enable multiprocessing for LLM by default (need to figure out a way around fork) - currently, need to set VLLM_ENABLE_V1_MULTIPROCESSING=1

DIAGRAM:

Note: this diagram is a bit dated. There is an EngineCoreClient class that is used by the AsyncLLM to interact with the EngineCore, but the overall architecture is close to what we have.
Note: stop strings are detected in the detokenizer and we send an abort message from output_handler_loop to EngineCore

…s-proto # Conflicts: # vllm/v1/engine/llm_engine.py # vllm/v1/tokenizer/detokenizer.py

WoosukKwon

Thanks for the great work!

WoosukKwon · 2024-11-11T22:09:11Z

vllm/envs.py

+    "VLLM_ENABLE_V1_MULTIPROCESSING":
+    lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))),


QQ: In which case should we turn this on?

VLLM_ENABLE_V1_MULTIPROCESSING=1 enables multiprocessing for EngineCore inside LLM (multiprocessing is always used for AsyncLLM right now). It is faster than the current implementation.

We will want to enable VLLM_ENABLE_V1_MULTIPROCESSING=1, but right now it is a problem for LLM since we cannot spawn without an if __name__ == "__main__" guard. We left solving this issue for follow up work.

robertgshaw2-neuralmagic · 2024-11-12T03:12:20Z

@robertgshaw2-neuralmagic @njhill this looks super 🚀🚀🚀

Couple questions for my own understanding:

How should I interpret the name core? There's both a v1/core package as well as v1/engine/core.py

It looks like this is way faster than v0 + multistep decoding, are we planning on ditching multistep in v1 or is that still TBD?

Thanks!

Its a bit unfortunate of a naming conflict. we can consider moving some files from this diff into v1/core
Goal of V1 is to simplify vLLM and make it faster such that multistep is not needed, since the code is complex and hard to maintain

lixiaolx · 2024-11-13T14:30:05Z

@robertgshaw2-neuralmagic I'm glad to see your optimized pr. I found some problems during the test and wanted to ask for advice. I set llama2-7b, 1gpu, batch=256, used V1-engine for testing and analysis, and used pr Comparing the test with your PR, the token gap is analyzed as follows:
pr-9289:

this-pr：

I am very happy that the new implementation has removed the token enqueue and dequeue time, but I found that the new version of update_schedule and schedule take longer. There is no major change in the total gap time
I carefully compared the code implementation. I found that there are no big changes.
I wonder if the new multi-threading of encode and decode causes the time consuming to become longer.

robertgshaw2-neuralmagic · 2024-11-13T14:50:48Z

Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify?

njhill · 2024-11-13T19:43:55Z

Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization.

Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU.

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

lixiaolx · 2024-11-14T02:34:38Z

Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization.

Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU.

Thank you very much for your answer. I tried to compare this solution. If we solve the GIL problem, the remaining gap time will be 2-3ms according to the above calculation.
I would like to ask if we have any plans to do asynchronous scheduling? Compared with sglang asynchronous, there is still a gap.
I recently analyzed that the overall test gap of sglang's asynchronous solution under the same conditions is between 200-300us. If you have a plan, are there any arrangements?

lixiaolx · 2024-11-14T02:39:14Z

Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify?
@robertgshaw2-neuralmagic
I compared the previous pr with your current pr, and did nsys analysis. I added nvtx to analyze the time overhead where the mainloop function is called, and split and analyzed the CPU overhead between the two forwards before and after the GPU.

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: OmerD <omer@run.ai>

lixiaolx · 2024-11-14T13:48:52Z

@robertgshaw2-neuralmagic @njhill Hello, does our pr support multiple gpu cards? Well, when testing llama2-70b 8gpu，occurs server log was stuck here.

I use nvidia-smi found that only 0 gpu card was occupying only about 500MB.

njhill · 2024-11-14T13:58:41Z

@lixiaolx the V1 path is still in an alpha state and does not yet support multiple GPUs, but will do soon.

lixiaolx · 2024-11-14T14:03:15Z

Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization.
Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU.

Thank you very much for your answer. I tried to compare this solution. If we solve the GIL problem, the remaining gap time will be 2-3ms according to the above calculation. I would like to ask if we have any plans to do asynchronous scheduling? Compared with sglang asynchronous, there is still a gap. I recently analyzed that the overall test gap of sglang's asynchronous solution under the same conditions is between 200-300us. If you have a plan, are there any arrangements?

@njhill ,Is there any arrangement for this asynchronous scheduling?

lixiaolx · 2024-11-14T14:03:29Z

@lixiaolx the V1 path is still in an alpha state and does not yet support multiple GPUs, but will do soon.

OK，thank you

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

njhill · 2024-11-14T19:30:20Z

@njhill ,Is there any arrangement for this asynchronous scheduling?

Not yet, our plan is to first optimize other aspects first since it will be complex to combine this with certain other optimizations.

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

B-201 · 2024-11-28T02:47:18Z

Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify?
@robertgshaw2-neuralmagic
I compared the previous pr with your current pr, and did nsys analysis. I added nvtx to analyze the time overhead where the mainloop function is called, and split and analyzed the CPU overhead between the two forwards before and after the GPU.

Sorry to bother you, but I’d like to ask how you added nvtx to analyze the time overhead of these function calls?

Signed-off-by: Nick Hill <nickhill@us.ibm.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

robertgshaw2-neuralmagic added 30 commits October 26, 2024 22:01

prototype

8f8662e

revert spurious 2.5 changes

01c4ca8

stash

1ad8a48

cleanup

f9084f6

add MQLLMEnginev1

72bccd9

work with MQLLMEngine

a6cab52

format

885ed16

cleanup formatting

3ed66cf

revert exmple change

8ae8ce9

update comment

5c72515

formatting

f9b33fa

updated

82539b9

stash

d42a54e

format

3a2d02a

Merge branch 'main' into rs-prototype-2

6028ee1

update

6bd37c1

revert bind/connect

196d822

revert comment

a089cd1

formatting

974aa06

formatting tweaks

fe1e1b4

move detokenizer into engine

9c27fbb

format

95b5af1

stash

3999279

revert bad import

b4dd571

format

f01f992

format

be333fa

add files

aefb498

stash

6d7f473

update

f431f8a

update

be431e4

njhill added 2 commits November 11, 2024 13:28

Merge remote-tracking branch 'refs/remotes/origin/main' into rework-r…

8c47b3c

…s-proto # Conflicts: # vllm/v1/engine/llm_engine.py # vllm/v1/tokenizer/detokenizer.py

Address some minor review comments

7cb08b7

mergify bot removed the needs-rebase label Nov 11, 2024

WoosukKwon approved these changes Nov 11, 2024

View reviewed changes

robertgshaw2-neuralmagic enabled auto-merge (squash) November 11, 2024 22:19

robertgshaw2-neuralmagic merged commit 6ace6fb into vllm-project:main Nov 11, 2024
72 checks passed

DarkLight1337 mentioned this pull request Nov 13, 2024

[Bug]: The Qwen series models produce garbled output when generating long texts. #9825

Closed

1 task

ywang96 mentioned this pull request Nov 13, 2024

[V1] Add missing tokenizer options for Detokenizer #10288

Merged

lixiaolx mentioned this pull request Nov 25, 2024

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

Open

1 task

ywang96 mentioned this pull request Dec 7, 2024

[V1] Initial support of multimodal models for V1 re-arch #10699

Merged

4 tasks

tlrmchlsmth mentioned this pull request Dec 19, 2024

[V1] TP Ray executor #11107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] `AsyncLLM` Implementation #9826

[V1] `AsyncLLM` Implementation #9826

robertgshaw2-neuralmagic commented Oct 30, 2024 •

edited

Loading

WoosukKwon left a comment

WoosukKwon Nov 11, 2024

robertgshaw2-neuralmagic Nov 11, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Nov 12, 2024

lixiaolx commented Nov 13, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Nov 13, 2024

njhill commented Nov 13, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

B-201 commented Nov 28, 2024

		"VLLM_ENABLE_V1_MULTIPROCESSING":
		lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))),

[V1] AsyncLLM Implementation #9826

[V1] AsyncLLM Implementation #9826

Conversation

robertgshaw2-neuralmagic commented Oct 30, 2024 • edited Loading

SUMMARY:

TODO:

FOLLOW UP PRS:

DIAGRAM:

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Nov 11, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Nov 12, 2024

lixiaolx commented Nov 13, 2024 • edited Loading

robertgshaw2-neuralmagic commented Nov 13, 2024

njhill commented Nov 13, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

lixiaolx commented Nov 14, 2024

njhill commented Nov 14, 2024

B-201 commented Nov 28, 2024

[V1] `AsyncLLM` Implementation #9826

[V1] `AsyncLLM` Implementation #9826

robertgshaw2-neuralmagic commented Oct 30, 2024 •

edited

Loading

robertgshaw2-neuralmagic Nov 11, 2024 •

edited

Loading

lixiaolx commented Nov 13, 2024 •

edited

Loading