Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefill is ok, but decode stucked #8

Closed
artetaout opened this issue Nov 29, 2024 · 18 comments
Closed

prefill is ok, but decode stucked #8

artetaout opened this issue Nov 29, 2024 · 18 comments

Comments

@artetaout
Copy link

the prefill already send KV
image

but the decoder, stucked in drop_select
image

did i miss something?

@alogfans
Copy link
Collaborator

You can try to check the connectivity between two nodes (different config files in TCP & RDMA mode). The "KV send DONE" message just implies that the KVCache entry has been submitted, rather than delivered by remote.

@artetaout
Copy link
Author

thanks for reply, but i run on single one node, so does this mean
the prefiller with etcd works well
and something wrong with the decoder and etcd ? and how to check them?

possibly it is etcd that does not run well?
i run etcd with the command in doc and test them with etcd client api successfully

@alogfans
Copy link
Collaborator

alogfans commented Dec 2, 2024

The problem may be caused by incorrect confs (e.g., the mooncake.json file, env variables). You can try to recheck:

  • In mooncake.json file, prefill_url and decode_url should match the env variables VLLM_HOST_IP="192.168.0.137" VLLM_PORT="51000" in both prefill & decode side. Use different ports if you perform them in one machine.
  • protocol should be set as tcp
  • In both prefill & decode side, the mooncake.json file should be usually exact (don't need to swap prefill_urlanddecode_url` fields)

If the above steps cannot solve your problem, you can try to run our Transfer Engine Bench with --protocol=tcp.

@artetaout
Copy link
Author

I run on one single node, can you help me correct it ?
the IP and port settings must be wrong in somewhere, here's command I used

  • the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
  • the mooncake.json
{
  "prefill_url": "192.168.0.2:13002",
  "decode_url": "192.168.0.2:14002",
  "metadata_server": "192.168.0.2:2379",
  "protocol": "tcp",
  "device_name": ""
}
  • the prefill command
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0 HF_ENDPOINT=https://hf-mirror.com VLLM_HOST_IP="192.168.0.2" VLLM_PORT="51000" MASTER_ADDR="192.168.0.2" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95
  • the decode command
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=1 HF_ENDPOINT=https://hf-mirror.com VLLM_HOST_IP="192.168.0.2" VLLM_PORT="51000" MASTER_ADDR="192.168.0.2" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95

@artetaout
Copy link
Author

and the TransferEngineBench works well,

transfer_engine_bench --mode=initiator --metadata_server=192.168.0.2:2379 --local_server_name=192.168.0.2:12346 --segment_id=192.168.0.2:12345 --protocol=tcp
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1202 06:36:55.500823  3108 transfer_engine_bench.cpp:182] Worker 3 stopped!
I1202 06:36:55.500823  3106 transfer_engine_bench.cpp:182] Worker 1 stopped!
I1202 06:36:55.503688  3107 transfer_engine_bench.cpp:182] Worker 2 stopped!
I1202 06:36:55.503718  3105 transfer_engine_bench.cpp:182] Worker 0 stopped!
I1202 06:36:55.503857  3100 transfer_engine_bench.cpp:293] Test completed: duration 10.00, batch count 5700, throughput 0.30

@pansicheng
Copy link

pansicheng commented Dec 2, 2024

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

@ShangmingCai
Copy link
Collaborator

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try vllm-project/vllm#10502, which has already been merged into the main branch of vLLM.

@liweiqing1997
Copy link

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try vllm-project/vllm#10502, which has already been merged into the main branch of vLLM.

Hello, could you help me check the correctness of the configuration? When I run the following configuration on a single machine, the decode process gets blocked at this step: Initializing an LLM engine.

{

"prefill_url": "127.0.0.1:31287",

"decode_url": "127.0.0.1:31282",

"metadata_server": "127.0.0.19:2379",

"protocol": "tcp",

"device_name": ""

}

CUDA_VISIBLE_DEVICES=4 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22301" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8200 --max-model-len 32000 --gpu-memory-utilization 0.95 > log/decode.log 2>&1 &

CUDA_VISIBLE_DEVICES=3 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8100 --max-model-len 16000 --gpu-memory-utilization 0.95 > log/producer.log 2>&1 &

Could you please tell me whether the VLLM_PORT settings for prefill and decode in a single machine need to be different?

@artetaout
Copy link
Author

artetaout commented Dec 3, 2024

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try vllm-project/vllm#10502, which has already been merged into the main branch of vLLM.

Hello, could you help me check the correctness of the configuration? When I run the following configuration on a single machine, the decode process gets blocked at this step: Initializing an LLM engine.

{

"prefill_url": "127.0.0.1:31287",

"decode_url": "127.0.0.1:31282",

"metadata_server": "127.0.0.19:2379",

"protocol": "tcp",

"device_name": ""

}

CUDA_VISIBLE_DEVICES=4 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22301" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8200 --max-model-len 32000 --gpu-memory-utilization 0.95 > log/decode.log 2>&1 &

CUDA_VISIBLE_DEVICES=3 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8100 --max-model-len 16000 --gpu-memory-utilization 0.95 > log/producer.log 2>&1 &

Could you please tell me whether the VLLM_PORT settings for prefill and decode in a single machine need to be different?

the VLLM_PORT and VLLM_HOST_IP needs to be same, because they're used to init process group, otherwise you'll get stucked. And please add more debug log as I listed below

@liweiqing1997
Copy link

liweiqing1997 commented Dec 4, 2024

我遇到了类似的问题,发现代码(https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86)导致预填充和解码的发送方和接收方无法在单个节点上正确连接。我对其进行了如下修改:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

另外,我对 mooncake.json 进行了如下配置(prefill_port = decrypt_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

上述更改解决了该问题

感谢挖掘。此代码原本默认用于运行跨节点分解预填充演示,因此我们没有考虑同一节点上多个实例的端口占用问题,这个问题将在未来得到解决。仅供参考,如果您想在同一个节点上运行分解预填充演示,可以尝试vllm-project/vllm#10502,它已经合并到 vLLM 的主分支中。

你好,能帮我检查一下配置的正确性吗?当我在单台机器上运行以下配置时,解码过程会在这一步被阻塞:初始化 LLM 引擎。
{
"prefill_url": "127.0.0.1:31287",
“decode_url”:“127.0.0.1:31282”,
“元数据服务器”:“127.0.0.19:2379”,
“协议”:“tcp”,
“设备名称”:“”
}

CUDA_VISIBLE_DEVICES=4 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22301" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8200 --max-model-len 32000 --gpu-memory-utilization 0.95 > log/decode.log 2>&1 &

CUDA_VISIBLE_DEVICES=3 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8100 --max-model-len 16000 --gpu-memory-utilization 0.95 > log/producer.log 2>&1 &

请问一下单台机器的预填充和解码的VLLM_PORT设置是否需要不同?

VLLM_PORT 和 VLLM_HOST_IP 需要相同,因为它们用于初始化进程组,否则您将陷入困境。请添加更多调试日志,如下所示

Thank you, the service can be started now, but I encountered another issue: after inferring N requests(input_len=512 output_len=128 qps= 0.5), a bug occurred during the decode node,prefill node is ok。


> 
> INFO 12-04 06:12:25 logger.py:37] Received request cmpl-e1ba5c1d38a24d19bda01cdc83ae7b4f-0: prompt: '骉恝珪珛珹琊玼珖�𪟝珽珦珫珒𬍤珢珕珝𫭼埗垾垺埆垿埌埇莰茝𬜯鄀莶莝䓖莙栻桠�𬂩桄梠栴梴栒酎酏𫠆砵砠砫砬硁恧翃郪�𨐈辀辁�𬌗剕赀哢晅晊唝哳哱冔晔晐畖蚄蚆�𫑡帱崁峿𪨶崄帨崀赆𬬸钷𬬻𬬹𬬿𬭁眚甡笫倻倴脩倮倕倞�𫢸倓倧衃虒舭舯舥瓞鬯鸰脎朓胲虓鱽狴峱狻眢𫗧勍痄疰痃竘羖羓桊敉烠烔烶烻𬊈涍浡浭浬涄涢涐浰浟浛浼浲涘悈悃悢𬒈宧窅窊窎扅扆袪袗袯祧隺堲疍�𨺙陴烝砮㛚哿翀翂剟𬳿𫄨绤骍𬘫�䂮琎珸珵琄琈琀珺掭堎堐埼掎埫堌晢�𫮃掞埪壸㙍聍菝萚菥莿䓫勚䓬萆菂菍菼萣䓨菉䓛梼梽桲梾桯梣梌桹敔厣硔鿎硙硚硊硍勔䴕龁逴唪啫翈�㫰晙畤𬱖趼跂蛃蚲𬟽蚺啴䎃崧崟崞崒崌崡铏𫓯𫟹铕𫟼铖铘铚铞铥铴牻牿稆笱笯偰偡鸺偭偲偁�㿠鄅偓徛衒舳舲鸼悆鄃瓻�䝙脶脞脟䏲鱾猇猊猄觖�𠅤庱庼庳痓䴔竫堃阌羝羕焆烺焌淏𬇹淟淜淴淯湴涴𬍡�㥄惛惔悰惙寁逭𬤇𫍯袼裈祲𬤊𫍲谞艴弸弶𬯎隃婞娵婼媖婳婍婌婫婤婘婠𬘬𬘭𬴂𫘦绹𫟅𬘯骕𫘧絜珷琲琡琟琔琭堾堼揕㙘堧喆堨塅堠絷�𪣻�𡎚葜惎萳葙靬葴蒇蒈鄚蒉蓇萩葰葎鄑蒎葖蒄萹棤棽棫椓椑�𬃊鹀椆棓棬棪椀楗�𬷕甦酦觌奡皕硪欹詟𫐐辌棐龂�𬹼黹牚睎晫晪晱��𧿹蛑畯斝喤崶嵁�𫶇崾嵅崿嵚翙𫖮圌圐赑赒鿏铹𬭊铽𨱇𫓶锊锍锎𬭎锓犇颋稌筀筘筜筥筅傃傉翛傒傕舾畬𫖯脿腘�䐃腙腒𬱟鲃猰�𫛭猯�㺄馉凓鄗', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=['<endendend>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [122560, 122561, 122562, 122563, 122564, 122565, 122566, 122567, 5691, 122569, 122570, 122571, 122572, 122573, 122574, 122575, 122576, 122577, 122578, 122579, 122580, 122581, 122582, 122583, 122584, 122585, 122586, 122587, 122588, 122589, 122590, 122591, 122592, 122593, 122594, 122595, 5691, 122597, 122598, 122599, 122600, 122601, 122602, 122603, 122604, 122605, 122606, 122607, 122608, 122609, 122610, 122611, 122612, 122613, 5691, 122615, 122616, 122617, 5691, 122619, 122620, 122621, 122622, 122623, 122624, 122625, 122626, 122627, 122628, 122629, 122630, 122631, 122632, 122633, 5691, 122635, 122636, 122637, 122638, 122639, 122640, 122641, 122642, 122643, 122644, 122645, 122646, 122647, 122648, 122649, 122650, 122651, 122652, 122653, 122654, 122655, 122656, 122657, 122658, 5691, 122660, 122661, 122662, 122663, 122664, 122665, 122666, 122667, 122668, 122669, 122670, 122671, 122672, 122673, 122674, 122675, 122676, 122677, 122678, 122679, 122680, 122681, 122682, 122683, 122684, 122685, 122686, 122687, 122688, 122689, 122690, 122691, 122692, 122693, 122694, 122695, 122696, 122697, 122698, 122699, 122700, 122701, 122702, 122703, 122704, 122705, 122706, 122707, 122708, 122709, 122710, 122711, 122712, 122713, 122714, 122715, 122716, 122717, 122718, 122719, 122720, 122721, 122722, 122723, 122724, 5691, 122726, 122727, 122728, 122729, 122730, 122731, 122732, 122733, 122734, 122735, 122736, 122737, 122738, 122739, 5691, 122741, 122742, 122743, 122744, 122745, 122746, 122747, 122748, 122749, 122750, 122751, 122752, 122753, 122754, 122755, 122756, 5691, 122758, 122759, 122760, 122761, 122762, 122763, 122764, 122765, 122766, 122767, 122768, 122769, 122770, 122771, 122772, 122773, 122774, 122775, 122776, 122777, 122778, 122779, 122780, 122781, 122782, 122783, 122784, 122785, 122786, 122787, 122788, 122789, 122790, 122791, 122792, 122793, 122794, 122795, 122796, 122797, 122798, 122799, 122800, 122801, 5691, 122803, 122804, 122805, 122806, 122807, 122808, 122809, 122810, 122811, 122812, 122813, 122814, 122815, 122816, 122817, 122818, 122819, 122820, 122821, 122822, 122823, 122824, 122825, 122826, 122827, 122828, 122829, 122830, 122831, 122832, 122833, 122834, 122835, 122836, 122837, 122838, 122839, 122840, 122841, 122842, 5691, 122844, 122845, 122846, 122847, 122848, 122849, 122850, 122851, 122852, 122853, 122854, 5691, 122856, 122857, 122858, 122859, 122860, 122861, 122862, 122863, 122864, 122865, 5691, 122867, 122868, 122869, 122870, 122871, 122872, 122873, 122874, 122875, 122876, 122877, 122878, 122879, 122880, 122881, 122882, 122883, 122884, 122885, 122886, 122887, 122888, 122889, 5691, 122891, 122892, 122893, 122894, 122895, 122896, 122897, 122898, 122899, 122900, 122901, 122902, 122903, 122904, 122905, 122906, 122907, 122908, 122909, 122910, 122911, 122912, 122913, 122914, 122915, 122916, 122917, 122918, 122919, 122920, 122921, 122922, 122923, 122924, 122925, 122926, 122927, 122928, 122929, 122930, 122931, 122932, 122933, 122934, 122935, 122936, 122937, 122938, 122939, 122940, 122941, 122942, 122943, 122944, 122945, 122946, 122947, 5691, 122949, 5691, 122951, 122952, 122953, 122954, 122955, 122956, 122957, 122958, 122959, 122960, 122961, 122962, 122963, 122964, 122965, 122966, 122967, 122968, 122969, 122970, 122971, 122972, 122973, 122974, 122975, 5691, 122977, 122978, 122979, 122980, 122981, 122982, 122983, 122984, 5691, 122986, 122987, 122988, 122989, 122990, 122991, 122992, 122993, 122994, 122995, 122996, 122997, 122998, 5691, 123000, 123001, 123002, 123003, 123004, 123005, 123006, 9973, 123009, 123010, 123011, 123012, 123013, 123014, 123015, 5691, 123017, 123018, 123019, 123020, 123021, 123022, 123023, 123024, 123025, 123026, 123027, 123028, 123029, 123030, 123031, 123032, 123033, 123034, 123035, 123036, 123037, 123038, 123039, 123040, 123041, 123042, 123043, 123044, 123045, 123046, 123047, 123048, 123049, 123050, 123051, 123052, 123053, 123054, 123055, 123056, 5691, 123058, 123059, 123060, 123061, 123062, 123063, 5691, 123065, 123066, 5691, 123068, 123069, 123070, 123071], lora_request: None, prompt_adapter_request: None.
> INFO:     127.0.0.1:48480 - "POST /v1/completions HTTP/1.1" 200 OK
> INFO 12-04 06:12:27 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241204-061227.pkl...
> WARNING 12-04 06:12:27 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
> ERROR 12-04 06:12:27 engine.py:157] UnpicklingError("Error in model execution: invalid load key, 'H'.")
> ERROR 12-04 06:12:27 engine.py:157] Traceback (most recent call last):
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
> ERROR 12-04 06:12:27 engine.py:157]     return func(*args, **kwargs)
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/worker/model_runner.py", line 1656, in execute_model
> ERROR 12-04 06:12:27 engine.py:157]     get_disagg_group().recv_kv_caches_and_hidden_states(
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/distributed/kv_transfer/vllm_adapter.py", line 297, in recv_kv_caches_and_hidden_states
> ERROR 12-04 06:12:27 engine.py:157]     ret = self.recv_buffer.drop_select(
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 186, in drop_select
> ERROR 12-04 06:12:27 engine.py:157]     value = self.data_pipe.recv_tensor()
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py", line 241, in recv_tensor
> ERROR 12-04 06:12:27 engine.py:157]     tensor = self.transport_thread.submit(self._recv_impl).result()
> ERROR 12-04 06:12:27 engine.py:157]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
> ERROR 12-04 06:12:27 engine.py:157]     return self.__get_result()
> ERROR 12-04 06:12:27 engine.py:157]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
> ERROR 12-04 06:12:27 engine.py:157]     raise self._exception
> ERROR 12-04 06:12:27 engine.py:157]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
> ERROR 12-04 06:12:27 engine.py:157]     result = self.fn(*self.args, **self.kwargs)
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py", line 227, in _recv_impl
> ERROR 12-04 06:12:27 engine.py:157]     return pickle.loads(data)
> ERROR 12-04 06:12:27 engine.py:157] _pickle.UnpicklingError: invalid load key, 'H'.
> ERROR 12-04 06:12:27 engine.py:157] 
> ERROR 12-04 06:12:27 engine.py:157] The above exception was the direct cause of the following exception:
> ERROR 12-04 06:12:27 engine.py:157] 
> ERROR 12-04 06:12:27 engine.py:157] Traceback (most recent call last):
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/engine/multiprocessing/engine.py", line 155, in start
> ERROR 12-04 06:12:27 engine.py:157]     self.run_engine_loop()
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
> ERROR 12-04 06:12:27 engine.py:157]     request_outputs = self.engine_step()
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
> ERROR 12-04 06:12:27 engine.py:157]     raise e
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
> ERROR 12-04 06:12:27 engine.py:157]     return self.engine.step()
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/engine/llm_engine.py", line 1387, in step
> ERROR 12-04 06:12:27 engine.py:157]     outputs = self.model_executor.execute_model(
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/executor/gpu_executor.py", line 136, in execute_model
> ERROR 12-04 06:12:27 engine.py:157]     output = self.driver_worker.execute_model(execute_model_req)
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/worker/worker_base.py", line 327, in execute_model
> ERROR 12-04 06:12:27 engine.py:157]     output = self.model_runner.execute_model(
> ERROR 12-04 06:12:27 engine.py:157]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
> ERROR 12-04 06:12:27 engine.py:157]     return func(*args, **kwargs)
> ERROR 12-04 06:12:27 engine.py:157]   File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/mooncake/mooncake-integration/vllm/vllm/worker/model_runner_base.py", line 146, in _wrapper
> ERROR 12-04 06:12:27 engine.py:157]     raise type(err)(f"Error in model execution: "
> ERROR 12-04 06:12:27 engine.py:157] _pickle.UnpicklingError: Error in model execution: invalid load key, 'H'.
> ERROR:    Exception in ASGI application

My startup parameters are set as follows:
My code is sourced from a library, and the link is:https://github.com/kvcache-ai/vllm/tree/mooncake-integration
Running on two NVIDIA A100 GPUs within a single machine.


> 
>  1
> nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379  > log/etcd.log 2>&1 &
> 
> 2. Run on the prefilling side (producer role)
> CUDA_VISIBLE_DEVICES=5 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8010 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/producer.log 2>&1 &
> 
> 3. Run on the decoding side (consumer role)
> CUDA_VISIBLE_DEVICES=6 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True  nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8020 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/decode.log 2>&1 &
> 
> nohup python3 proxy_server.py > log/proxy_server_port_9032.log 2>&1 &
> 
> mooncake.json is:

{
"prefill_url": "127.0.0.1:31287",
"decode_url": "127.0.0.1:31282",
"metadata_server": "127.0.0.19:2379",
"protocol": "tcp",
"device_name": ""
}

@ShangmingCai
Copy link
Collaborator

My startup parameters are set as follows: My code is sourced from a library, and the link is:https://github.com/kvcache-ai/vllm/tree/mooncake-integration Running on two NVIDIA A100 GPUs within a single machine.


> 
>  1
> nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379  > log/etcd.log 2>&1 &
> 
> 2. Run on the prefilling side (producer role)
> CUDA_VISIBLE_DEVICES=5 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8010 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/producer.log 2>&1 &
> 
> 3. Run on the decoding side (consumer role)
> CUDA_VISIBLE_DEVICES=6 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True  nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8020 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/decode.log 2>&1 &
> 
> nohup python3 proxy_server.py > log/proxy_server_port_9032.log 2>&1 &
> 
> mooncake.json is:

{ "prefill_url": "127.0.0.1:31287", "decode_url": "127.0.0.1:31282", "metadata_server": "127.0.0.19:2379", "protocol": "tcp", "device_name": "" }

Hello, thank you for trying this PR. We have released a nightly version that is based on the main branch of vLLM, which also addresses the ports conflict and supports tp. You can have it a try to see if this problem still occurs.

Also, I have noticed that you use a strange prompt:

prompt: '骉恝珪珛珹琊玼珖�𪟝珽珦珫珒𬍤珢珕珝𫭼埗垾垺埆垿埌埇莰茝𬜯鄀莶莝䓖莙栻桠�𬂩桄梠栴梴栒酎酏𫠆砵砠砫砬硁恧翃郪�𨐈辀辁�𬌗剕赀哢晅晊唝哳哱冔晔晐畖蚄蚆�𫑡帱崁峿𪨶崄帨崀赆𬬸钷𬬻𬬹𬬿𬭁眚甡笫倻倴脩倮倕倞�𫢸倓倧衃虒舭舯舥瓞鬯鸰脎朓胲虓鱽狴峱狻眢𫗧勍痄疰痃竘羖羓桊敉烠烔烶烻𬊈涍浡浭浬涄涢涐浰浟浛浼浲涘悈悃悢𬒈宧窅窊窎扅扆袪袗袯祧隺堲疍�𨺙陴烝砮㛚哿翀翂剟𬳿𫄨绤骍𬘫�䂮琎珸珵琄琈琀珺掭堎堐埼掎埫堌晢�𫮃掞埪壸㙍聍菝萚菥莿䓫勚䓬萆菂菍菼萣䓨菉䓛梼梽桲梾桯梣梌桹敔厣硔鿎硙硚硊硍勔䴕龁逴唪啫翈�㫰晙畤𬱖趼跂蛃蚲𬟽蚺啴䎃崧崟崞崒崌崡铏𫓯𫟹铕𫟼铖铘铚铞铥铴牻牿稆笱笯偰偡鸺偭偲偁�㿠鄅偓徛衒舳舲鸼悆鄃瓻�䝙脶脞脟䏲鱾猇猊猄觖�𠅤庱庼庳痓䴔竫堃阌羝羕焆烺焌淏𬇹淟淜淴淯湴涴𬍡�㥄惛惔悰惙寁逭𬤇𫍯袼裈祲𬤊𫍲谞艴弸弶𬯎隃婞娵婼媖婳婍婌婫婤婘婠𬘬𬘭𬴂𫘦绹𫟅𬘯骕𫘧絜珷琲琡琟琔琭堾堼揕㙘堧喆堨塅堠絷�𪣻�𡎚葜惎萳葙靬葴蒇蒈鄚蒉蓇萩葰葎鄑蒎葖蒄萹棤棽棫椓椑�𬃊鹀椆棓棬棪椀楗�𬷕甦酦觌奡皕硪欹詟𫐐辌棐龂�𬹼黹牚睎晫晪晱��𧿹蛑畯斝喤崶嵁�𫶇崾嵅崿嵚翙𫖮圌圐赑赒鿏铹𬭊铽𨱇𫓶锊锍锎𬭎锓犇颋稌筀筘筜筥筅傃傉翛傒傕舾畬𫖯脿腘�䐃腙腒𬱟鲃猰�𫛭猯�㺄馉凓鄗'

Maybe there exists some bytes encoding problem in your proxy server, which triggers a bug fail of pickle.

@liweiqing1997
Copy link

My startup parameters are set as follows: My code is sourced from a library, and the link is:https://github.com/kvcache-ai/vllm/tree/mooncake-integration Running on two NVIDIA A100 GPUs within a single machine.


> 
>  1
> nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379  > log/etcd.log 2>&1 &
> 
> 2. Run on the prefilling side (producer role)
> CUDA_VISIBLE_DEVICES=5 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8010 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/producer.log 2>&1 &
> 
> 3. Run on the decoding side (consumer role)
> CUDA_VISIBLE_DEVICES=6 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True  nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8020 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/decode.log 2>&1 &
> 
> nohup python3 proxy_server.py > log/proxy_server_port_9032.log 2>&1 &
> 
> mooncake.json is:

{ "prefill_url": "127.0.0.1:31287", "decode_url": "127.0.0.1:31282", "metadata_server": "127.0.0.19:2379", "protocol": "tcp", "device_name": "" }

Hello, thank you for trying this PR. We have released a nightly version that is based on the main branch of vLLM, which also addresses the ports conflict and supports tp. You can have it a try to see if this problem still occurs.

Also, I have noticed that you use a strange prompt:

prompt: '骉恝珪珛珹琊玼珖�𪟝珽珦珫珒𬍤珢珕珝𫭼埗垾垺埆垿埌埇莰茝𬜯鄀莶莝䓖莙栻桠�𬂩桄梠栴梴栒酎酏𫠆砵砠砫砬硁恧翃郪�𨐈辀辁�𬌗剕赀哢晅晊唝哳哱冔晔晐畖蚄蚆�𫑡帱崁峿𪨶崄帨崀赆𬬸钷𬬻𬬹𬬿𬭁眚甡笫倻倴脩倮倕倞�𫢸倓倧衃虒舭舯舥瓞鬯鸰脎朓胲虓鱽狴峱狻眢𫗧勍痄疰痃竘羖羓桊敉烠烔烶烻𬊈涍浡浭浬涄涢涐浰浟浛浼浲涘悈悃悢𬒈宧窅窊窎扅扆袪袗袯祧隺堲疍�𨺙陴烝砮㛚哿翀翂剟𬳿𫄨绤骍𬘫�䂮琎珸珵琄琈琀珺掭堎堐埼掎埫堌晢�𫮃掞埪壸㙍聍菝萚菥莿䓫勚䓬萆菂菍菼萣䓨菉䓛梼梽桲梾桯梣梌桹敔厣硔鿎硙硚硊硍勔䴕龁逴唪啫翈�㫰晙畤𬱖趼跂蛃蚲𬟽蚺啴䎃崧崟崞崒崌崡铏𫓯𫟹铕𫟼铖铘铚铞铥铴牻牿稆笱笯偰偡鸺偭偲偁�㿠鄅偓徛衒舳舲鸼悆鄃瓻�䝙脶脞脟䏲鱾猇猊猄觖�𠅤庱庼庳痓䴔竫堃阌羝羕焆烺焌淏𬇹淟淜淴淯湴涴𬍡�㥄惛惔悰惙寁逭𬤇𫍯袼裈祲𬤊𫍲谞艴弸弶𬯎隃婞娵婼媖婳婍婌婫婤婘婠𬘬𬘭𬴂𫘦绹𫟅𬘯骕𫘧絜珷琲琡琟琔琭堾堼揕㙘堧喆堨塅堠絷�𪣻�𡎚葜惎萳葙靬葴蒇蒈鄚蒉蓇萩葰葎鄑蒎葖蒄萹棤棽棫椓椑�𬃊鹀椆棓棬棪椀楗�𬷕甦酦觌奡皕硪欹詟𫐐辌棐龂�𬹼黹牚睎晫晪晱��𧿹蛑畯斝喤崶嵁�𫶇崾嵅崿嵚翙𫖮圌圐赑赒鿏铹𬭊铽𨱇𫓶锊锍锎𬭎锓犇颋稌筀筘筜筥筅傃傉翛傒傕舾畬𫖯脿腘�䐃腙腒𬱟鲃猰�𫛭猯�㺄馉凓鄗'

Maybe there exists some bytes encoding problem in your proxy server, which triggers a bug fail of pickle.

My startup parameters are set as follows: My code is sourced from a library, and the link is:https://github.com/kvcache-ai/vllm/tree/mooncake-integration Running on two NVIDIA A100 GPUs within a single machine.


> 
>  1
> nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379  > log/etcd.log 2>&1 &
> 
> 2. Run on the prefilling side (producer role)
> CUDA_VISIBLE_DEVICES=5 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8010 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/producer.log 2>&1 &
> 
> 3. Run on the decoding side (consumer role)
> CUDA_VISIBLE_DEVICES=6 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True  nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8020 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/decode.log 2>&1 &
> 
> nohup python3 proxy_server.py > log/proxy_server_port_9032.log 2>&1 &
> 
> mooncake.json is:

{ "prefill_url": "127.0.0.1:31287", "decode_url": "127.0.0.1:31282", "metadata_server": "127.0.0.19:2379", "protocol": "tcp", "device_name": "" }

Hello, thank you for trying this PR. We have released a nightly version that is based on the main branch of vLLM, which also addresses the ports conflict and supports tp. You can have it a try to see if this problem still occurs.

Also, I have noticed that you use a strange prompt:

prompt: '骉恝珪珛珹琊玼珖�𪟝珽珦珫珒𬍤珢珕珝𫭼埗垾垺埆垿埌埇莰茝𬜯鄀莶莝䓖莙栻桠�𬂩桄梠栴梴栒酎酏𫠆砵砠砫砬硁恧翃郪�𨐈辀辁�𬌗剕赀哢晅晊唝哳哱冔晔晐畖蚄蚆�𫑡帱崁峿𪨶崄帨崀赆𬬸钷𬬻𬬹𬬿𬭁眚甡笫倻倴脩倮倕倞�𫢸倓倧衃虒舭舯舥瓞鬯鸰脎朓胲虓鱽狴峱狻眢𫗧勍痄疰痃竘羖羓桊敉烠烔烶烻𬊈涍浡浭浬涄涢涐浰浟浛浼浲涘悈悃悢𬒈宧窅窊窎扅扆袪袗袯祧隺堲疍�𨺙陴烝砮㛚哿翀翂剟𬳿𫄨绤骍𬘫�䂮琎珸珵琄琈琀珺掭堎堐埼掎埫堌晢�𫮃掞埪壸㙍聍菝萚菥莿䓫勚䓬萆菂菍菼萣䓨菉䓛梼梽桲梾桯梣梌桹敔厣硔鿎硙硚硊硍勔䴕龁逴唪啫翈�㫰晙畤𬱖趼跂蛃蚲𬟽蚺啴䎃崧崟崞崒崌崡铏𫓯𫟹铕𫟼铖铘铚铞铥铴牻牿稆笱笯偰偡鸺偭偲偁�㿠鄅偓徛衒舳舲鸼悆鄃瓻�䝙脶脞脟䏲鱾猇猊猄觖�𠅤庱庼庳痓䴔竫堃阌羝羕焆烺焌淏𬇹淟淜淴淯湴涴𬍡�㥄惛惔悰惙寁逭𬤇𫍯袼裈祲𬤊𫍲谞艴弸弶𬯎隃婞娵婼媖婳婍婌婫婤婘婠𬘬𬘭𬴂𫘦绹𫟅𬘯骕𫘧絜珷琲琡琟琔琭堾堼揕㙘堧喆堨塅堠絷�𪣻�𡎚葜惎萳葙靬葴蒇蒈鄚蒉蓇萩葰葎鄑蒎葖蒄萹棤棽棫椓椑�𬃊鹀椆棓棬棪椀楗�𬷕甦酦觌奡皕硪欹詟𫐐辌棐龂�𬹼黹牚睎晫晪晱��𧿹蛑畯斝喤崶嵁�𫶇崾嵅崿嵚翙𫖮圌圐赑赒鿏铹𬭊铽𨱇𫓶锊锍锎𬭎锓犇颋稌筀筘筜筥筅傃傉翛傒傕舾畬𫖯脿腘�䐃腙腒𬱟鲃猰�𫛭猯�㺄馉凓鄗'

Maybe there exists some bytes encoding problem in your proxy server, which triggers a bug fail of pickle.

Thank you for your reply; you are right.

@artetaout
Copy link
Author

My startup parameters are set as follows: My code is sourced from a library, and the link is:https://github.com/kvcache-ai/vllm/tree/mooncake-integration Running on two NVIDIA A100 GPUs within a single machine.


> 
>  1
> nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379  > log/etcd.log 2>&1 &
> 
> 2. Run on the prefilling side (producer role)
> CUDA_VISIBLE_DEVICES=5 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8010 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/producer.log 2>&1 &
> 
> 3. Run on the decoding side (consumer role)
> CUDA_VISIBLE_DEVICES=6 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True  nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8020 --max-model-len 16000 --gpu-memory-utilization 0.7 > log/decode.log 2>&1 &
> 
> nohup python3 proxy_server.py > log/proxy_server_port_9032.log 2>&1 &
> 
> mooncake.json is:

{ "prefill_url": "127.0.0.1:31287", "decode_url": "127.0.0.1:31282", "metadata_server": "127.0.0.19:2379", "protocol": "tcp", "device_name": "" }

Hello, thank you for trying this PR. We have released a nightly version that is based on the main branch of vLLM, which also addresses the ports conflict and supports tp. You can have it a try to see if this problem still occurs.

Also, I have noticed that you use a strange prompt:

prompt: '骉恝珪珛珹琊玼珖�𪟝珽珦珫珒𬍤珢珕珝𫭼埗垾垺埆垿埌埇莰茝𬜯鄀莶莝䓖莙栻桠�𬂩桄梠栴梴栒酎酏𫠆砵砠砫砬硁恧翃郪�𨐈辀辁�𬌗剕赀哢晅晊唝哳哱冔晔晐畖蚄蚆�𫑡帱崁峿𪨶崄帨崀赆𬬸钷𬬻𬬹𬬿𬭁眚甡笫倻倴脩倮倕倞�𫢸倓倧衃虒舭舯舥瓞鬯鸰脎朓胲虓鱽狴峱狻眢𫗧勍痄疰痃竘羖羓桊敉烠烔烶烻𬊈涍浡浭浬涄涢涐浰浟浛浼浲涘悈悃悢𬒈宧窅窊窎扅扆袪袗袯祧隺堲疍�𨺙陴烝砮㛚哿翀翂剟𬳿𫄨绤骍𬘫�䂮琎珸珵琄琈琀珺掭堎堐埼掎埫堌晢�𫮃掞埪壸㙍聍菝萚菥莿䓫勚䓬萆菂菍菼萣䓨菉䓛梼梽桲梾桯梣梌桹敔厣硔鿎硙硚硊硍勔䴕龁逴唪啫翈�㫰晙畤𬱖趼跂蛃蚲𬟽蚺啴䎃崧崟崞崒崌崡铏𫓯𫟹铕𫟼铖铘铚铞铥铴牻牿稆笱笯偰偡鸺偭偲偁�㿠鄅偓徛衒舳舲鸼悆鄃瓻�䝙脶脞脟䏲鱾猇猊猄觖�𠅤庱庼庳痓䴔竫堃阌羝羕焆烺焌淏𬇹淟淜淴淯湴涴𬍡�㥄惛惔悰惙寁逭𬤇𫍯袼裈祲𬤊𫍲谞艴弸弶𬯎隃婞娵婼媖婳婍婌婫婤婘婠𬘬𬘭𬴂𫘦绹𫟅𬘯骕𫘧絜珷琲琡琟琔琭堾堼揕㙘堧喆堨塅堠絷�𪣻�𡎚葜惎萳葙靬葴蒇蒈鄚蒉蓇萩葰葎鄑蒎葖蒄萹棤棽棫椓椑�𬃊鹀椆棓棬棪椀楗�𬷕甦酦觌奡皕硪欹詟𫐐辌棐龂�𬹼黹牚睎晫晪晱��𧿹蛑畯斝喤崶嵁�𫶇崾嵅崿嵚翙𫖮圌圐赑赒鿏铹𬭊铽𨱇𫓶锊锍锎𬭎锓犇颋稌筀筘筜筥筅傃傉翛傒傕舾畬𫖯脿腘�䐃腙腒𬱟鲃猰�𫛭猯�㺄馉凓鄗'

Maybe there exists some bytes encoding problem in your proxy server, which triggers a bug fail of pickle.

hi, do you have plan to support XpYd based on current version? As we can see, the vLLM's PR's roadmap is including it

@ShangmingCai
Copy link
Collaborator

hi, do you have plan to support XpYd based on current version? As we can see, the vLLM's PR's roadmap is including it

Yes, we are working on XpYd and the scheduler in an inner version. Also, we have found some teams working on this too. I think the vLLM community will welcome various XpYd implementations, which will help the community find the most efficient and practical way to support this.

@artetaout
Copy link
Author

hi, do you have plan to support XpYd based on current version? As we can see, the vLLM's PR's roadmap is including it

Yes, we are working on XpYd and the scheduler in an inner version. Also, we have found some teams working on this too. I think the vLLM community will welcome various XpYd implementations, which will help the community find the most efficient and practical way to support this.

So, when will you release the XpYd ?

@artetaout
Copy link
Author

artetaout commented Dec 5, 2024

hi, do you have plan to support XpYd based on current version? As we can see, the vLLM's PR's roadmap is including it

Yes, we are working on XpYd and the scheduler in an inner version. Also, we have found some teams working on this too. I think the vLLM community will welcome various XpYd implementations, which will help the community find the most efficient and practical way to support this.

So, when will you release the XpYd version ?

@ShangmingCai
Copy link
Collaborator

hi, do you have plan to support XpYd based on current version? As we can see, the vLLM's PR's roadmap is including it

Yes, we are working on XpYd and the scheduler in an inner version. Also, we have found some teams working on this too. I think the vLLM community will welcome various XpYd implementations, which will help the community find the most efficient and practical way to support this.

So, when will you release the XpYd version ?

This feature is still under development and testing, and there is no clear release time yet. If you are interested, you can keep up and follow the progress of this project and the vllm community.

@ShangmingCai
Copy link
Collaborator

Since v0.2 has been released, which addresses the port conflict issues, I think we can close this issue for now.

Feel free to reopen it or raise a new issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants