[Ray backend] Better error when pg topology is bad. #7584

rkooo567 · 2024-08-16T08:03:29Z

Closes #6956
Closes #2406

If the resources of a driver node are not reserved by pg, it raises a RuntimeError
If the node has less GPUs reserved than tp size, it raise an error (maybe it could be warning log as well. To be discussed).

Also added some tests

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-08-16T08:03:43Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

youkaichao

can this guarantee the driver worker node is included in the placement group?

vllm/executor/ray_utils.py

rkooo567 · 2024-08-16T16:23:23Z

ah, I misunderstood the requirement.

I think there's a way to get it working. Are we aligned on this semantic?

if all tp cannot place on the same node, error out -> just read your msg, and we will just warn
if current node is not included in the pg, error out (and we internally prioritize scheduling on a current node).

youkaichao · 2024-08-16T16:27:02Z

if no pg found, create a pg and make sure the current node is in the pg
if we already have a pg, but current node is not included in the pg, error out
if tp cross nodes, warn

rkooo567 · 2024-08-16T16:28:41Z

gotcha. ETA eod today

rkooo567 · 2024-08-17T06:32:01Z

Addressed comments;

the first bundle always require node_ip resource, which means the first bundle always has to be created in a current node.
- there's no API to use node_id here unfortunately in ray now... we are planning to add it, but it will be after oct.
Better error message while waiting to create a placement group.
Logger.warning if tp workers are not in the same node.
validation logic will validate if at least one bundle in a pg is created in a current node.

youkaichao · 2024-08-17T06:43:07Z

tests/distributed/test_multi_node_topology.py

+
+from vllm.utils import cuda_device_count_stateless
+
+TARGET_TEST_SUITE = os.environ.get("TARGET_TEST_SUITE", "L4")


please clean up the tests and remove unnecessary code. this test does not need TARGET_TEST_SUITE I think.

youkaichao · 2024-08-17T06:43:52Z

vllm/executor/ray_utils.py

+                f"available in a node {current_node_id=} {current_ip=}.")
+        # This way, at least bundle is required to be created in a current
+        # node.
+        placement_group_specs[0][f"node:{current_ip}"] = 0.001


is this the key to make sure current node is included in the placement group?

yes that's right. I failed to find other clean way to support this unfortunately...

rkooo567 · 2024-08-17T07:24:25Z

will fix tests and address comments by this sun

youkaichao · 2024-08-19T06:14:25Z

tests/distributed/test_multi_node_topology.py

+
+    # Simulate 2 node clusters, 1 GPU each.
+    cluster = Cluster()
+    head_node = cluster.add_node(num_cpus=8, num_gpus=1, resources={"head": 1})


why num_cpus=8 here? is 8 some magic number?

We don't have any guarentee on the number of GPUs we have for this 2 GPUs test.

ray doesn't require real hardware cpus when specifying resources like this. 8 is just a random number. (I think technically num_cpus=1 should also work)

youkaichao · 2024-08-19T06:17:07Z

tests/distributed/test_multi_node_topology.py

+@pytest.mark.parametrize("model, distributed_executor_backend, test_suite", [
+    ("facebook/opt-125m", "ray"),
+])
+def test_multi_node_bad_topology(


I think you should test it with 2 nodes test (where ray cluster is already created), and try to launch the test with 1 gpu from both the head node and the worker node, and make sure the vllm instance is scheduled with the same node as the process who launched it.

hmm yeah I think we need real 2 nodes if we want to use node ip resources. let me fix it.

vllm/executor/ray_utils.py

youkaichao · 2024-08-21T20:52:14Z

tests/distributed/test_multi_node_topology.py

+    @ray.remote(num_gpus=1)
+    class Actor:
+        pass
+
+    # a is created on a head node.
+    actors = [Actor.remote() for _ in range(2)]  # type: ignore
+    ray.get([a.__ray_ready__.remote() for a in actors])


is it guaranteed that these actors are allocated in the head node?

yes. it is a little of implementation detail, but actor is prioritized to be scheduled on a node that creates them.

But good point though. Let me just use https://docs.ray.io/en/master/ray-core/api/doc/ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy.html#ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy so that we don't need to rely on impl details..

youkaichao · 2024-08-21T21:01:24Z

Looks like there is a misunderstanding.

What I want to test, is basically:

        worker_wrapper_kwargs = self._get_worker_wrapper_args()
        for bundle_id, bundle in enumerate(placement_group.bundle_specs):
            if not bundle.get("GPU", 0):
                continue
            scheduling_strategy = PlacementGroupSchedulingStrategy(
                placement_group=placement_group,
                placement_group_capture_child_tasks=True,
                placement_group_bundle_index=bundle_id,
            )

            worker = ray.remote(
                num_cpus=0,
                num_gpus=num_gpus,
                scheduling_strategy=scheduling_strategy,
                **ray_remote_kwargs,
            )(RayWorkerWrapper).remote(**worker_wrapper_kwargs)

When we create actors (which needs 2 GPUs) within the placement group (which has 4 GPUs in 2 nodes), we can make sure the actors are scheduled in the head node.

rkooo567 · 2024-08-22T00:44:14Z

discussed offline. we want to make sure all gpus in the current node is used before other nodes are used. Will modify code and tests accordingly. sorry for the delay

Co-authored-by: youkaichao <youkaichao@126.com>

darthhexx · 2024-09-24T03:26:14Z

These changes have unfortunately broken Ray 2.9 compatibility, because ray._private.state.available_resources_per_node is named _available_resources_per_node in 2.9.3 and only named available_resources_per_node from 2.10 onwards.

Does this mean 2.9 is no longer supported or should we conditionally import the correct class function instead?

rkooo567 · 2024-09-24T04:09:00Z

Hmm I guess this never been stable API, and that's probably the issue.

I think we can conditionally import this. @darthhexx would you like to tackle this real quick?

darthhexx · 2024-09-24T04:48:07Z

Yip, on it.

darthhexx · 2024-09-24T07:31:48Z

Prepped a PR to conditionally import, but as mentioned in the PR description we also need the fastapi bump in #8212 to be pinned to match Ray 2.9.x to keep Ray 2.9 compatibility.

Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Co-authored-by: youkaichao <youkaichao@126.com>

basic working

16bcb3f

rkooo567 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 16, 2024

rkooo567 requested a review from youkaichao August 16, 2024 08:03

rkooo567 added 2 commits August 16, 2024 01:05

move test to a 2 nodes tests.

7f75862

lint fix test

3a5ab31

youkaichao reviewed Aug 16, 2024

View reviewed changes

vllm/executor/ray_utils.py Outdated Show resolved Hide resolved

youkaichao reviewed Aug 16, 2024

View reviewed changes

vllm/executor/ray_utils.py Outdated Show resolved Hide resolved

rkooo567 added 3 commits August 16, 2024 22:53

Merge branch 'main' into fix-ray-value-error-due-to-missing-pg

59cdddb

fixed

0f8655f

fixed

9ba1f73

linting

17d4385

youkaichao reviewed Aug 17, 2024

View reviewed changes

rkooo567 added 2 commits August 18, 2024 20:00

Merge branch 'main' into fix-ray-value-error-due-to-missing-pg

c361edd

.

30e0df2

youkaichao reviewed Aug 19, 2024

View reviewed changes

vllm/executor/ray_utils.py Outdated Show resolved Hide resolved

rkooo567 added 3 commits August 19, 2024 12:14

Merge branch 'main' into fix-ray-value-error-due-to-missing-pg

c832faf

fixed

a85dcbc

ip

87654a6

youkaichao mentioned this pull request Aug 20, 2024

Release v0.5.5 #7481

Closed

fixed

a8e5aca

rkooo567 added 2 commits August 21, 2024 09:13

test 2

a3f9719

fix

5ef0512

youkaichao reviewed Aug 21, 2024

View reviewed changes

rkooo567 added 2 commits August 21, 2024 14:07

fixed

53848df

fix

59f8e2b

youkaichao added 2 commits August 22, 2024 13:59

Merge branch 'main' into fix-ray-value-error-due-to-missing-pg

2c91954

update tests

f5772fc

youkaichao removed the ready ONLY add when PR is ready to merge/full CI is needed label Aug 22, 2024

youkaichao added 5 commits August 22, 2024 14:36

update tests

a01a1ff

refine

09f841a

fix test

196d857

fix test

4802308

rearrange tests

8d6b00a

youkaichao merged commit c01a6cb into vllm-project:main Aug 23, 2024
22 of 29 checks passed

youkaichao mentioned this pull request Aug 25, 2024

[Usage]: Error with Multi Node llama 405B inference #6938

Closed

omrishiv pushed a commit to omrishiv/vllm that referenced this pull request Aug 26, 2024

[Ray backend] Better error when pg topology is bad. (vllm-project#7584)

93b1b9d

Co-authored-by: youkaichao <youkaichao@126.com>

This was referenced Sep 24, 2024

[Bug]: FastAPI 0.113.0 breaks vLLM OpenAPI #8212

Closed

[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node #8767

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Ray backend] Better error when pg topology is bad. (vllm-project#7584)

2a7a761

Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Alvant <alvasian@yandex.ru>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Ray backend] Better error when pg topology is bad. (vllm-project#7584)

f988b5f

Co-authored-by: youkaichao <youkaichao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray backend] Better error when pg topology is bad. #7584

[Ray backend] Better error when pg topology is bad. #7584

rkooo567 commented Aug 16, 2024 •

edited

Loading

github-actions bot commented Aug 16, 2024

youkaichao left a comment

rkooo567 commented Aug 16, 2024 •

edited

Loading

youkaichao commented Aug 16, 2024

rkooo567 commented Aug 16, 2024

rkooo567 commented Aug 17, 2024 •

edited

Loading

youkaichao Aug 17, 2024

youkaichao Aug 17, 2024

rkooo567 Aug 17, 2024

rkooo567 commented Aug 17, 2024

youkaichao Aug 19, 2024

rkooo567 Aug 19, 2024

youkaichao Aug 19, 2024 •

edited

Loading

rkooo567 Aug 19, 2024

youkaichao Aug 21, 2024

rkooo567 Aug 21, 2024

youkaichao commented Aug 21, 2024

rkooo567 commented Aug 22, 2024

darthhexx commented Sep 24, 2024

rkooo567 commented Sep 24, 2024

darthhexx commented Sep 24, 2024

darthhexx commented Sep 24, 2024 •

edited

Loading


		from vllm.utils import cuda_device_count_stateless

		TARGET_TEST_SUITE = os.environ.get("TARGET_TEST_SUITE", "L4")

[Ray backend] Better error when pg topology is bad. #7584

[Ray backend] Better error when pg topology is bad. #7584

Conversation

rkooo567 commented Aug 16, 2024 • edited Loading

Also added some tests

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Aug 16, 2024

youkaichao left a comment

Choose a reason for hiding this comment

rkooo567 commented Aug 16, 2024 • edited Loading

youkaichao commented Aug 16, 2024

rkooo567 commented Aug 16, 2024

rkooo567 commented Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Aug 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao commented Aug 21, 2024

rkooo567 commented Aug 22, 2024

darthhexx commented Sep 24, 2024

rkooo567 commented Sep 24, 2024

darthhexx commented Sep 24, 2024

darthhexx commented Sep 24, 2024 • edited Loading

rkooo567 commented Aug 16, 2024 •

edited

Loading

rkooo567 commented Aug 16, 2024 •

edited

Loading

rkooo567 commented Aug 17, 2024 •

edited

Loading

youkaichao Aug 19, 2024 •

edited

Loading

darthhexx commented Sep 24, 2024 •

edited

Loading