[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

alogfans · 2024-11-28T00:29:46Z

This PR is related to #10727, as well a continuation of PR #8498, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.

Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.

Compared with NCCL, Mooncake Transfer Engine has the following features:

a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
support for TCP, RDMA, and NVMe-of protocols
topology-aware path selection (link to our english doc, transfer_engine.md), aggregating bandwidth from multiple NICs

Like the current implementation of PR #8498, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)

Provider side implements insert: insert a KV cache into a buffer, so that it can be transferred upon request
Consumer side implements drop_select: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the buffer

Both roles are run in different machines.

Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration.md

Benchmark result: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md

…ode instances

mergify · 2024-11-29T03:28:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alogfans.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

stmatengss · 2024-11-29T03:56:18Z

Currently, this PR is based on the early version of #8498. We plan to clean up and rebase the code against the latest version soon. Apologies for triggering the request review prematurely.

mergify · 2024-12-02T01:02:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alogfans.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

KuntaiDu · 2024-12-02T01:09:00Z

The new version of disaggregated prefill PR #10502 is just merged, and feel free to continue development in vLLM's main branch! API-wise the new PR is pretty similar to the old PR so (hopefully) it is straightforward to migrate the implementation.

junna2016 · 2024-12-03T08:00:46Z

Can you provide a test example to run disaggregated prefill/decoding mode with MooncakeDistributedPipe scene?

ShangmingCai · 2024-12-03T08:13:39Z

Can you provide a test example to run disaggregated prefill/decoding mode with MooncakeDistributedPipe scene?

You can refer to this doc to run a demo based on PR 8498. Currently, we are rebasing from the main branch. It is nearly done, but we will run more tests to ensure its compatibility.

junna2016 · 2024-12-03T08:19:40Z

Can you provide a test example to run disaggregated prefill/decoding mode with MooncakeDistributedPipe scene?

You can refer to this doc to run a demo based on PR 8498. Currently, we are rebasing from the main branch. It is nearly done, but we will run more tests to ensure its compatibility.

Thanks a lot

ShangmingCai · 2024-12-04T03:34:53Z

After rebase, we move the development to PR #10884 now.

DarkLight1337 · 2024-12-16T02:27:17Z

Closing as superseded by #10884

KuntaiDu added 30 commits July 18, 2024 20:33

only log when rank%4==0

cc89bfb

bug fix

531bdf3

also only log when rank=4 in custom all reduce

1804656

add debuging statement around broadcast

81c8640

debug init_world_group

5ba142c

put the log inside a text file

cc939cf

init DISAGG first

8ac9266

init DISAGG before global

58849fa

put it behind world_size

08797e2

add more debug information in pynccl

4ff4cd6

typo fix

b09e4e6

more debug

583de97

more debug info

74bcfff

put every output

2175825

remove unnecessary sleep

3e07770

add sucess statement

a22e5cd

add debug statement

2c0c27d

log rank in success message

a783787

sleep based on rank to avoid message overlapping

79f0b06

increase torch debug level

b17f20f

sleep

025f209

set gloo debugging level to trace

32292f1

reduce debugging commands

389fb24

avoid initializing NCCL first

1b38b29

check

bb8c08a

locate the hanging line

25a7cf3

add rank to CPU group

999bd72

narrow case

3428ea6

bug fix: need to align the distributed groups between prefill and dec…

91e3ed2

…ode instances

add disaggregated prefilling for flashinfer

3dd2275

stmatengss requested review from comaniac, simon-mo, robertgshaw2-neuralmagic, tlrmchlsmth, WoosukKwon, njhill, LiuXiaoxuanPKU, KuntaiDu, DarkLight1337, ywang96 and zhuohan123 as code owners November 29, 2024 03:28

mergify bot added documentation Improvements or additions to documentation frontend labels Nov 29, 2024

mergify bot added the needs-rebase label Nov 29, 2024

Fix format to make ruff happy.

c1477fb

stmatengss force-pushed the mooncake-integration-patch branch from 7132c75 to c1477fb Compare November 29, 2024 03:43

mergify bot removed the needs-rebase label Nov 29, 2024

mergify bot added the needs-rebase label Dec 2, 2024

KuntaiDu mentioned this pull request Dec 2, 2024

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Open

34 tasks

ShangmingCai mentioned this pull request Dec 4, 2024

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

Merged

DarkLight1337 closed this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

alogfans commented Nov 28, 2024 •

edited by github-actions bot

Loading

mergify bot commented Nov 29, 2024

stmatengss commented Nov 29, 2024

mergify bot commented Dec 2, 2024

KuntaiDu commented Dec 2, 2024

junna2016 commented Dec 3, 2024 •

edited

Loading

ShangmingCai commented Dec 3, 2024

junna2016 commented Dec 3, 2024

ShangmingCai commented Dec 4, 2024

DarkLight1337 commented Dec 16, 2024

[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

Conversation

alogfans commented Nov 28, 2024 • edited by github-actions bot Loading

mergify bot commented Nov 29, 2024

stmatengss commented Nov 29, 2024

mergify bot commented Dec 2, 2024

KuntaiDu commented Dec 2, 2024

junna2016 commented Dec 3, 2024 • edited Loading

ShangmingCai commented Dec 3, 2024

junna2016 commented Dec 3, 2024

ShangmingCai commented Dec 4, 2024

DarkLight1337 commented Dec 16, 2024

alogfans commented Nov 28, 2024 •

edited by github-actions bot

Loading

junna2016 commented Dec 3, 2024 •

edited

Loading