This is the latest version of mooncake-transfer-engine integration doc with the vLLM project based on PR 10502 and PR 10884 (vllm version: v0.6.4.post1/main) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario. We have run some experiments to obtain some preview benchmark results. More benchmark results will be released in due time.
Please note that this is still an experimental version and will be modified anytime based on feedback from the vLLM community.
Please install the Mooncake Transfer Engine according to the instructions first.
git clone git@github.com:vllm-project/vllm.git
cd vllm
pip3 uninstall vllm -y
pip3 install -e .
- If the build fails, try upgrading the version of cmake through
pip3 install cmake --upgrade
. - If you encounter any problems that you cannot solve, please refer to the vLLM official compilation guide.
- Prepare a mooncake.json file for both Prefill and Decode instances
- You don't need to change the
prefill_url
anddecode_url
of the config file in the decode side, please use the identical config file.
{
"prefill_url": "192.168.0.137:13003",
"decode_url": "192.168.0.139:13003",
"metadata_backend": "etcd",
"metadata_server": "192.168.0.139:2379",
"protocol": "rdma",
"device_name": "erdma_0"
}
- "prefill_url": The IP address and port of the Prefill node.
- The port in the URL is used to communicate with etcd server for metadata.
- "decode_url": The IP address and port of the Decode node.
- The port in the URL is used to communicate with etcd server for metadata.
- If you want to run the prefill instance and decode instance on the same node, please set up a different port for the
decode_url
. To avoid port conflicts, ensure that the port number differs by at least 50 from the port number inprefill_url
. For example, "decode_url": "192.168.0.137:13103". Please note that if you set up the same URL for both instances, we will automatically add 100 to the port of thedecode_url
.
- "metadata_backend": Currently we support "etcd" and "redis" backends. If this parameter is absent, the mooncake transfer engine will use "etcd" automatically.
- "metadata_server": The etcd server of the mooncake transfer engine.
- "protocol": The protocol to be used for data transmission. ("rdma/tcp")
- "device_name": The device to be used for data transmission, it is required when "protocol" is set to "rdma". If multiple NIC devices are used, they can be separated by commas such as "erdma_0,erdma_1". Please note that there are no spaces between them.
- Prepare a mooncake.json file for both Prefill and Decode instances
{
"prefill_url": "192.168.0.137:13003",
"decode_url": "192.168.0.139:13003",
"metadata_backend": "etcd",
"metadata_server": "192.168.0.139:2379",
"protocol": "tcp",
"device_name": ""
}
- Please change the IP addresses and ports in the following guide according to your env.
# Begin from `root` of your cloned repo!
# 1. Start the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
# You may need to terminate other etcd processes before running the above command
# 2. Run on the prefilling side (producer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'
# 3. Run on the decoding side (consumer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":2e9}'
-
MOONCAKE_CONFIG_PATH
is the path to the mooncake.json configuration file. -
VLLM_USE_MODELSCOPE
is optional, if you have access to huggingface, please remove it. -
The
--model
parameter specifies the model to use. -
The
--port
parameter specifies the vllm service port on which to listen. -
The
--max-model-len
parameter specifies the maximum length of the model. -
Option
--tensor_parallel_size
\-tp
is supported now. But you need to set up--enforce_eager
to disable cuda graph. Example: append-tp 2 --enforce_eager
to the run command.- If you want to run the prefill instance and decode instance on the same node, please set up different
CUDA_VISIBLE_DEVICES
. For example,CUDA_VISIBLE_DEVICES=0,1
for the prefill instance andCUDA_VISIBLE_DEVICES=2,3
for the decode instance.
- If you want to run the prefill instance and decode instance on the same node, please set up different
-
The
--kv-transfer-config
parameter specifies the connector and its config to be used.- Please set up
kv_connector
toMooncakeConnector
. kv_role
is the node's role, either 'kv_producer' or 'kv_consumer'.kv_rank
is the rank of the instance. Currently,kv_producer
's rank is 0,kv_consumer
's rank is 1.kv_parallel_size
is fixed to 2 currently.kv_buffer_size
is the size of the KVCache lookup buffer, if the averageinput_len
of the prompt is large, please increase the buffer size. If the OOM still occurs, please decrease the ratio of--gpu-memory-utilization
.kv_ip
andkv_port
are used to specify the IP address and port of the master node for "PyNcclConnector" distributed setup. It is not used for "MooncakeConnector" currently. Instead, "MooncakeConnector" uses a config file to set up the distributed connection. Therefore, you don't need to set these params for "MooncakeConnector" currently.
- Please set up
# 4. Start the proxy server on one node (Let's take the prefill node as an example)
python3 proxy_server.py
The implementation of proxy_server.py
import os
import aiohttp
from quart import Quart, make_response, request
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
app = Quart(__name__)
async def forward_request(url, data):
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
async with session.post(url=url, json=data,
headers=headers) as response:
if response.status == 200:
if True:
async for chunk_bytes in response.content.iter_chunked(
1024):
yield chunk_bytes
else:
content = await response.read()
yield content
@app.route('/v1/completions', methods=['POST'])
async def handle_request():
try:
original_request_data = await request.get_json()
prefill_request = original_request_data.copy()
# change max_tokens = 1 to let it only do prefill
prefill_request['max_tokens'] = 1
# finish prefill
async for _ in forward_request('http://localhost:8100/v1/completions',
prefill_request):
continue
# return decode
generator = forward_request('http://192.168.0.139:8200/v1/completions', # Be sure to change the IP address for your machine
original_request_data)
response = await make_response(generator)
response.timeout = None
return response
except Exception as e:
import sys
import traceback
exc_info = sys.exc_info()
print("Error occurred in disagg prefill proxy server")
print(e)
print("".join(traceback.format_exception(*exc_info)))
if __name__ == '__main__':
app.run(host="0.0.0.0",port=8000)
Be sure to change the IP address in the code.
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 1000
}'
- If you are not testing on the proxy server, please change the
localhost
to the IP address of the proxy server.