-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] add placement group support #20091
Conversation
self._placement_group. | ||
""" | ||
current_placement_group = get_current_placement_group() | ||
should_capture_child_tasks_in_placement_group = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man I wish there was a better way to handle this. Do we have a shared doc with Tune on asks for placement group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #20196 to track this!
@@ -279,6 +286,9 @@ def execute_single(self, worker_index: int, func: Callable[..., T], *args, | |||
def remove_workers(self, worker_indexes: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now with placement groups the semantics here get a little tricky since we can’t add workers without first removing them if placement groups is enabled.
Do you think we should do either of the following?
- Make remove_workers and add_workers private, and only expose a single replace_workers API. This will ensure that remove and add are called atomically (except for the initial creation) and your placement group will always be “full”.
- Keep track of the free bundles in the placement group and if add_workers is called without free resources in the placement group, then raise an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, my initial thoughts are:
- This is reasonable for the current system, but I don't think it should be a requirement of the
WorkerGroup
/ActorGroup
. With the presented API we don't really make any assumptions on the size of the placement group. - This could get tricky keeping track of what a "free" bundle is (if worker:bundle is 1:1 this is easier but not sure if this makes sense). Would prefer to leave this up to placement groups to be responsible for this and raise the error when this happens ([placement groups/autoscaler] unfulfillable requests should raise an error #18018).
I do actually think #18524 is the preferred solution for Ray Train and placement groups are just a workaround to achieve this functionality, so it doesn't seem right to me to build WorkerGroup
around placement group semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to add_workers
to warn any users about this.
- Even prior to this change, the same undefined behavior would be present if
add_workers
is called within a parent placement group or if the cluster is full. - The only usecase in
TensorflowBackend
actually has some custom logic in betweenremove_workers
andadd_workers
.
|
||
# Integer value which if set will change the placement group strategy from | ||
# PACK to SPREAD. 1 for True, 0 for False. | ||
TRAIN_ENABLE_WORKER_SPREAD_ENV =\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I have to set this on all the machines (including HEAD and non-HEAD). Why is that? Isn't the scheduling part handled by head node (and partially client side)? @matthewdeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @HuangLED, what version of Ray are you using? Also are you using Ray Client?
This should be addressed in master (via this commit) so that the environment variable should only need to be set on the driver, but even without this change the environment variable should only be needed on the host where the BackendExecutor
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using v1.9.0 I think.
Yes, I am using client mode. e.g. Two machines M1 and M2, while M1 being the head. Then connecting from a third machine using ray.init("ray://M1:port").
Solely setting the env on M1 does not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you verify that the BackendExecutor
process is running on M1?
Why are these changes needed?
Placement groups are used as an implementation detail for scheduling Ray Train workers.
Scheduling behavior is as follows:
Trainer
is already within a placement group andshould_capture_child_tasks_in_placement_group
is True, then the workers will inherit this placement group. This is currently used for theTune
integration whereTune
will manage the placement groups.PACK
strategy. This is primarily done with the intention to improve GPU training by reducing network communication overhead.SPREAD
strategy by setting theTRAIN_ENABLE_WORKER_SPREAD
environment variable to1
. Use-cases that may require this strategy include CPU-only training and elastic training. In the future this can be promoted up to theTrainer
API when the need becomes more established.Related issue number
Closes #19610
Checks
scripts/format.sh
to lint the changes in this PR.