-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] add placement group support #20091
Changes from 3 commits
5a4dbef
7118815
f15ec69
79478c3
d6bb724
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,3 +49,12 @@ | |
# Backend.share_cuda_visible_devices. 1 for True, 0 for False. | ||
ENABLE_SHARE_CUDA_VISIBLE_DEVICES_ENV =\ | ||
"TRAIN_ENABLE_SHARE_CUDA_VISIBLE_DEVICES" | ||
|
||
# Integer value which indicates the number of seconds to wait when creating | ||
# the worker placement group before timing out. | ||
TRAIN_PLACEMENT_GROUP_TIMEOUT_S_ENV = "TRAIN_PLACEMENT_GROUP_TIMEOUT_S" | ||
|
||
# Integer value which if set will change the placement group strategy from | ||
# PACK to SPREAD. 1 for True, 0 for False. | ||
TRAIN_ENABLE_WORKER_SPREAD_ENV =\ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like I have to set this on all the machines (including HEAD and non-HEAD). Why is that? Isn't the scheduling part handled by head node (and partially client side)? @matthewdeng There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @HuangLED, what version of Ray are you using? Also are you using Ray Client? This should be addressed in master (via this commit) so that the environment variable should only need to be set on the driver, but even without this change the environment variable should only be needed on the host where the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are using v1.9.0 I think. Yes, I am using client mode. e.g. Two machines M1 and M2, while M1 being the head. Then connecting from a third machine using ray.init("ray://M1:port"). Solely setting the env on M1 does not work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you verify that the |
||
"TRAIN_ENABLE_WORKER_SPREAD" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,12 @@ | ||
import logging | ||
import socket | ||
from dataclasses import dataclass | ||
import logging | ||
from typing import Callable, List, TypeVar, Optional, Dict, Type, Tuple | ||
from typing import Callable, List, TypeVar, Optional, Dict, Type, Tuple, Union | ||
|
||
import ray | ||
from ray.actor import ActorHandle | ||
from ray.types import ObjectRef | ||
from ray.util.placement_group import PlacementGroup | ||
|
||
T = TypeVar("T") | ||
|
||
|
@@ -105,6 +106,9 @@ class WorkerGroup: | |
remote actors. | ||
remote_cls_args, remote_cls_kwargs: If ``remote_cls`` is provided, | ||
these args will be used for the worker initialization. | ||
placement_group (PlacementGroup|str): The placement group that workers | ||
should be created in. Defaults to "default" which will inherit the | ||
parent placement group (if child tasks should be captured). | ||
|
||
|
||
Example: | ||
|
@@ -125,7 +129,8 @@ def __init__( | |
additional_resources_per_worker: Optional[Dict[str, float]] = None, | ||
actor_cls: Type = None, | ||
actor_cls_args: Optional[Tuple] = None, | ||
actor_cls_kwargs: Optional[Dict] = None): | ||
actor_cls_kwargs: Optional[Dict] = None, | ||
placement_group: Union[PlacementGroup, str] = "default"): | ||
|
||
if num_workers <= 0: | ||
raise ValueError("The provided `num_workers` must be greater " | ||
|
@@ -152,6 +157,8 @@ def __init__( | |
self._actor_cls_args = actor_cls_args or [] | ||
self._actor_cls_kwargs = actor_cls_kwargs or {} | ||
|
||
self._placement_group = placement_group | ||
|
||
# TODO(matt): Validate resources. Fast-fail if it is impossible to | ||
# handle the request, rather than hang indefinitely. | ||
self._remote_cls = ray.remote( | ||
|
@@ -279,6 +286,9 @@ def execute_single(self, worker_index: int, func: Callable[..., T], *args, | |
def remove_workers(self, worker_indexes: List[int]): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now with placement groups the semantics here get a little tricky since we can’t add workers without first removing them if placement groups is enabled. Do you think we should do either of the following?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great point, my initial thoughts are:
I do actually think #18524 is the preferred solution for Ray Train and placement groups are just a workaround to achieve this functionality, so it doesn't seem right to me to build There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a comment to
|
||
"""Removes the workers with the specified indexes. | ||
|
||
The removed workers will go out of scope and their actor processes | ||
will be terminated. | ||
|
||
Args: | ||
worker_indexes (List[int]): The indexes of the workers to remove. | ||
""" | ||
|
@@ -297,8 +307,9 @@ def add_workers(self, num_workers: int): | |
new_actors = [] | ||
new_actor_metadata = [] | ||
for _ in range(num_workers): | ||
actor = self._remote_cls.remote(*self._actor_cls_args, | ||
**self._actor_cls_kwargs) | ||
actor = self._remote_cls.options( | ||
placement_group=self._placement_group).remote( | ||
*self._actor_cls_args, **self._actor_cls_kwargs) | ||
new_actors.append(actor) | ||
new_actor_metadata.append( | ||
actor._BaseWorkerMixin__execute.remote(construct_metadata)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man I wish there was a better way to handle this. Do we have a shared doc with Tune on asks for placement group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #20196 to track this!