-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728
[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728
Conversation
This pull request has merge conflicts that must be resolved before it can be |
7132c75
to
c1477fb
Compare
Currently, this PR is based on the early version of #8498. We plan to clean up and rebase the code against the latest version soon. Apologies for triggering the request review prematurely. |
This pull request has merge conflicts that must be resolved before it can be |
The new version of disaggregated prefill PR #10502 is just merged, and feel free to continue development in vLLM's main branch! API-wise the new PR is pretty similar to the old PR so (hopefully) it is straightforward to migrate the implementation. |
Can you provide a test example to run disaggregated prefill/decoding mode with MooncakeDistributedPipe scene? |
You can refer to this doc to run a demo based on PR 8498. Currently, we are rebasing from the main branch. It is nearly done, but we will run more tests to ensure its compatibility. |
Thanks a lot |
After rebase, we move the development to PR #10884 now. |
Closing as superseded by #10884 |
This PR is related to #10727, as well a continuation of PR #8498, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.
Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.
Compared with NCCL, Mooncake Transfer Engine has the following features:
Like the current implementation of PR #8498, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)
insert
: insert a KV cache into a buffer, so that it can be transferred upon requestdrop_select
: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the bufferBoth roles are run in different machines.
Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration.md
Benchmark result: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md