Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray backend] Better error when pg topology is bad. #7584

Merged

Conversation

rkooo567
Copy link
Collaborator

@rkooo567 rkooo567 commented Aug 16, 2024

Closes #6956
Closes #2406

  • If the resources of a driver node are not reserved by pg, it raises a RuntimeError
  • If the node has less GPUs reserved than tp size, it raise an error (maybe it could be warning log as well. To be discussed).

Also added some tests

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

@rkooo567 rkooo567 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 16, 2024
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this guarantee the driver worker node is included in the placement group?

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Aug 16, 2024

ah, I misunderstood the requirement.

I think there's a way to get it working. Are we aligned on this semantic?

  • if all tp cannot place on the same node, error out -> just read your msg, and we will just warn
  • if current node is not included in the pg, error out (and we internally prioritize scheduling on a current node).

@youkaichao
Copy link
Member

  1. if no pg found, create a pg and make sure the current node is in the pg
  2. if we already have a pg, but current node is not included in the pg, error out
  3. if tp cross nodes, warn

@rkooo567
Copy link
Collaborator Author

gotcha. ETA eod today

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Aug 17, 2024

Addressed comments;

  • the first bundle always require node_ip resource, which means the first bundle always has to be created in a current node.
    • there's no API to use node_id here unfortunately in ray now... we are planning to add it, but it will be after oct.
  • Better error message while waiting to create a placement group.
  • Logger.warning if tp workers are not in the same node.
  • validation logic will validate if at least one bundle in a pg is created in a current node.


from vllm.utils import cuda_device_count_stateless

TARGET_TEST_SUITE = os.environ.get("TARGET_TEST_SUITE", "L4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clean up the tests and remove unnecessary code. this test does not need TARGET_TEST_SUITE I think.

f"available in a node {current_node_id=} {current_ip=}.")
# This way, at least bundle is required to be created in a current
# node.
placement_group_specs[0][f"node:{current_ip}"] = 0.001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the key to make sure current node is included in the placement group?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right. I failed to find other clean way to support this unfortunately...

@rkooo567
Copy link
Collaborator Author

will fix tests and address comments by this sun


# Simulate 2 node clusters, 1 GPU each.
cluster = Cluster()
head_node = cluster.add_node(num_cpus=8, num_gpus=1, resources={"head": 1})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why num_cpus=8 here? is 8 some magic number?

We don't have any guarentee on the number of GPUs we have for this 2 GPUs test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray doesn't require real hardware cpus when specifying resources like this. 8 is just a random number. (I think technically num_cpus=1 should also work)

@pytest.mark.parametrize("model, distributed_executor_backend, test_suite", [
("facebook/opt-125m", "ray"),
])
def test_multi_node_bad_topology(
Copy link
Member

@youkaichao youkaichao Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should test it with 2 nodes test (where ray cluster is already created), and try to launch the test with 1 gpu from both the head node and the worker node, and make sure the vllm instance is scheduled with the same node as the process who launched it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I think we need real 2 nodes if we want to use node ip resources. let me fix it.

@youkaichao youkaichao mentioned this pull request Aug 20, 2024
Comment on lines 55 to 61
@ray.remote(num_gpus=1)
class Actor:
pass

# a is created on a head node.
actors = [Actor.remote() for _ in range(2)] # type: ignore
ray.get([a.__ray_ready__.remote() for a in actors])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it guaranteed that these actors are allocated in the head node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. it is a little of implementation detail, but actor is prioritized to be scheduled on a node that creates them.

But good point though. Let me just use https://docs.ray.io/en/master/ray-core/api/doc/ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy.html#ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy so that we don't need to rely on impl details..

@youkaichao
Copy link
Member

Looks like there is a misunderstanding.

What I want to test, is basically:

        worker_wrapper_kwargs = self._get_worker_wrapper_args()
        for bundle_id, bundle in enumerate(placement_group.bundle_specs):
            if not bundle.get("GPU", 0):
                continue
            scheduling_strategy = PlacementGroupSchedulingStrategy(
                placement_group=placement_group,
                placement_group_capture_child_tasks=True,
                placement_group_bundle_index=bundle_id,
            )

            worker = ray.remote(
                num_cpus=0,
                num_gpus=num_gpus,
                scheduling_strategy=scheduling_strategy,
                **ray_remote_kwargs,
            )(RayWorkerWrapper).remote(**worker_wrapper_kwargs)

When we create actors (which needs 2 GPUs) within the placement group (which has 4 GPUs in 2 nodes), we can make sure the actors are scheduled in the head node.

@rkooo567
Copy link
Collaborator Author

discussed offline. we want to make sure all gpus in the current node is used before other nodes are used. Will modify code and tests accordingly. sorry for the delay

@youkaichao youkaichao removed the ready ONLY add when PR is ready to merge/full CI is needed label Aug 22, 2024
@youkaichao youkaichao merged commit c01a6cb into vllm-project:main Aug 23, 2024
22 of 29 checks passed
omrishiv pushed a commit to omrishiv/vllm that referenced this pull request Aug 26, 2024
@darthhexx
Copy link
Contributor

These changes have unfortunately broken Ray 2.9 compatibility, because ray._private.state.available_resources_per_node is named _available_resources_per_node in 2.9.3 and only named available_resources_per_node from 2.10 onwards.

Does this mean 2.9 is no longer supported or should we conditionally import the correct class function instead?

@rkooo567
Copy link
Collaborator Author

Hmm I guess this never been stable API, and that's probably the issue.

I think we can conditionally import this. @darthhexx would you like to tackle this real quick?

@darthhexx
Copy link
Contributor

Yip, on it.

@darthhexx
Copy link
Contributor

darthhexx commented Sep 24, 2024

Prepped a PR to conditionally import, but as mentioned in the PR description we also need the fastapi bump in #8212 to be pinned to match Ray 2.9.x to keep Ray 2.9 compatibility.

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Co-authored-by: youkaichao <youkaichao@126.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants