[distributed] add function to create ipc buffers directly #10064

youkaichao · 2024-11-06T01:14:57Z

pytorch's ipc handle format can change, and using pytorch for cuda ipc will suffer from pytorch's change. see #9815 for example.

cc @hanzhi713

Signed-off-by: youkaichao <youkaichao@gmail.com>

github-actions · 2024-11-06T01:15:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-11-06T02:20:57Z

test failure is because we don't have a100 gpus in our ci queue.

hanzhi713 · 2024-11-06T06:13:56Z

@youkaichao I can confirm this work on my machine.

hanzhi713 · 2024-11-06T06:15:54Z

vllm/distributed/device_communicators/custom_all_reduce.py

+        world_size = dist.get_world_size(group=group)
+        rank = dist.get_rank(group=group)
+        handles = [None] * world_size
+        dist.all_gather_object(handles, handle, group=group)


Why do we need to use broadcast with device=cpu in _gather_ipc_meta but not here?

when you use this function, the group argument should be cpu_group passed to custom allreduce object.

see

vllm/vllm/distributed/parallel_state.py

Lines 231 to 236 in 4089985

if use_custom_allreduce and self.world_size > 1:

# Initialize a custom fast all-reduce implementation.

self.ca_comm = CustomAllreduce(

group=self.cpu_group,

device=self.device,

)

Why is all_gather fine here but not in _gather_ipc_meta?

oh, that is because we met some issues with all_gather for tensors directly. here we are using all_gather_object , so it should be fine. see pytorch/pytorch#126032 for the pytorch issue.

I see. Someone refactored this to return a Tensor
https://github.com/vllm-project/vllm/pull/5047/files#diff-44d9d733ee604800cbce9858a9201db1044aeff2c940fa4a0521d0c9b6541b3eL137

A better way should be returning a string, if torch bindings doesn't support int8 directly.

yeah string should be fine. #5047 aims to get rid of pybind11 so that we can release python version agnostic wheels.

…lm-project#10064)" This reverts commit 4be3a45.

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com>

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao added 4 commits November 5, 2024 16:59

add

b14921f

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix format

b676e92

Signed-off-by: youkaichao <youkaichao@gmail.com>

polish

5023a9d

Signed-off-by: youkaichao <youkaichao@gmail.com>

polish

d257a5a

Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify bot added the ci/build label Nov 6, 2024

youkaichao mentioned this pull request Nov 6, 2024

[Core][Distributed] Refactor ipc buffer init in CustomAllreduce #10030

Merged

hanzhi713 reviewed Nov 6, 2024

View reviewed changes

youkaichao merged commit 4be3a45 into vllm-project:main Nov 6, 2024
22 of 26 checks passed

youkaichao deleted the create_shared_buffer branch November 6, 2024 06:35

flaviabeo added a commit to flaviabeo/vllm that referenced this pull request Nov 6, 2024

Revert "[distributed] add function to create ipc buffers directly (vl…

2dbfb98

…lm-project#10064)" This reverts commit 4be3a45.

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024

[distributed] add function to create ipc buffers directly (vllm-proje…

7c9dc0d

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[distributed] add function to create ipc buffers directly (vllm-proje…

23fae2e

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[distributed] add function to create ipc buffers directly (vllm-proje…

d272b66

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[distributed] add function to create ipc buffers directly (vllm-proje…

f10f0ff

…ct#10064) Signed-off-by: youkaichao <youkaichao@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[distributed] add function to create ipc buffers directly #10064

[distributed] add function to create ipc buffers directly #10064

youkaichao commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

youkaichao commented Nov 6, 2024

hanzhi713 commented Nov 6, 2024

hanzhi713 Nov 6, 2024

youkaichao Nov 6, 2024

youkaichao Nov 6, 2024

hanzhi713 Nov 6, 2024

youkaichao Nov 6, 2024

hanzhi713 Nov 6, 2024

youkaichao Nov 6, 2024

	if use_custom_allreduce and self.world_size > 1:
	# Initialize a custom fast all-reduce implementation.
	self.ca_comm = CustomAllreduce(
	group=self.cpu_group,
	device=self.device,
	)

[distributed] add function to create ipc buffers directly #10064

[distributed] add function to create ipc buffers directly #10064

Conversation

youkaichao commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

youkaichao commented Nov 6, 2024

hanzhi713 commented Nov 6, 2024

hanzhi713 Nov 6, 2024

Choose a reason for hiding this comment

youkaichao Nov 6, 2024

Choose a reason for hiding this comment

youkaichao Nov 6, 2024

Choose a reason for hiding this comment

hanzhi713 Nov 6, 2024

Choose a reason for hiding this comment

youkaichao Nov 6, 2024

Choose a reason for hiding this comment

hanzhi713 Nov 6, 2024

Choose a reason for hiding this comment

youkaichao Nov 6, 2024

Choose a reason for hiding this comment