-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix issues with worker churn in WorkerPool #36766
Merged
stephanie-wang
merged 11 commits into
ray-project:master
from
stephanie-wang:fix-worker-pool
Jun 27, 2023
Merged
[core] Fix issues with worker churn in WorkerPool #36766
stephanie-wang
merged 11 commits into
ray-project:master
from
stephanie-wang:fix-worker-pool
Jun 27, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
ericl
approved these changes
Jun 23, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It's probably worth re-running the nighly tests prior to merging.
/// The soft limit of the number of workers to keep around. | ||
/// We apply this limit to the idle workers instead of total workers, | ||
/// because the total number of workers used depends on the | ||
/// application. -1 means using the available number of CPUs. | ||
RAY_CONFIG(int64_t, num_workers_soft_limit, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change
RAY_CONFIG(int64_t, num_workers_soft_limit, -1) | |
RAY_CONFIG(int64_t, num_idle_workers_soft_limit, -1) |
ericl
added
the
@author-action-required
The PR author is responsible for the next step. Remove tag to send back to the reviewer.
label
Jun 23, 2023
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
stephanie-wang
force-pushed
the
fix-worker-pool
branch
from
June 26, 2023 16:09
59bd2f4
to
19f98d3
Compare
Running nightly tests here. |
8 tasks
arvind-chandra
pushed a commit
to lmco/ray
that referenced
this pull request
Aug 31, 2023
ray-project#36669 lists some issues found with workers not being reused. This turns out to have two root causes: We use the total number of CPUs as a soft limit for the total number of worker processes allowed. This leads to worker churn when there are some processes that are actively using resources but less than 1 CPU (e.g., actors), and other tasks that are trying to use the remaining cores. We will always end up over the soft limit, so we have to keep killing and restarting workers. (less serious) Workers are usually slow on their first task. Workers are selected for tasks in LIFO order and currently a worker that has just been started up is inserted last, which leads to slowdown on the next task. This isn't serious on its own, but becomes a performance issue when there is worker churn, and we end up killing a warmed up idle worker instead of the cold one. This PR makes some changes to improve the idle worker killing: Use the available CPUs as a limit for the number of idle workers allowed, instead of total CPUs as a limit for total workers allowed. When no CPUs are being used and/or all tasks use exactly 1 CPU, the new policy is equivalent to the old one. The num_workers_soft_limit config override option is now used as a soft limit for idle workers instead of total workers. Workers that were just started are now inserted at the beginning of the idle worker queue so that they are prioritized to be killed over warmed up workers. They will be kept alive for at least the idle timeout as usual. Related issue number Closes ray-project#36669. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
@author-action-required
The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Ray 2.6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
#36669 lists some issues found with workers not being reused. This turns out to have two root causes:
This PR makes some changes to improve the idle worker killing:
num_workers_soft_limit
config override option is now used as a soft limit for idle workers instead of total workers.Related issue number
Closes #36669.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.