Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix issues with worker churn in WorkerPool #36766

Merged
merged 11 commits into from
Jun 27, 2023

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Jun 23, 2023

Why are these changes needed?

#36669 lists some issues found with workers not being reused. This turns out to have two root causes:

  • We use the total number of CPUs as a soft limit for the total number of worker processes allowed. This leads to worker churn when there are some processes that are actively using resources but less than 1 CPU (e.g., actors), and other tasks that are trying to use the remaining cores. We will always end up over the soft limit, so we have to keep killing and restarting workers.
  • (less serious) Workers are usually slow on their first task. Workers are selected for tasks in LIFO order and currently a worker that has just been started up is inserted last, which leads to slowdown on the next task. This isn't serious on its own, but becomes a performance issue when there is worker churn, and we end up killing a warmed up idle worker instead of the cold one.

This PR makes some changes to improve the idle worker killing:

  • Use the available CPUs as a limit for the number of idle workers allowed, instead of total CPUs as a limit for total workers allowed. When no CPUs are being used and/or all tasks use exactly 1 CPU, the new policy is equivalent to the old one.
  • The num_workers_soft_limit config override option is now used as a soft limit for idle workers instead of total workers.
  • Workers that were just started are now inserted at the beginning of the idle worker queue so that they are prioritized to be killed over warmed up workers. They will be kept alive for at least the idle timeout as usual.

Related issue number

Closes #36669.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's probably worth re-running the nighly tests prior to merging.

python/ray/tests/test_worker_capping.py Outdated Show resolved Hide resolved
/// The soft limit of the number of workers to keep around.
/// We apply this limit to the idle workers instead of total workers,
/// because the total number of workers used depends on the
/// application. -1 means using the available number of CPUs.
RAY_CONFIG(int64_t, num_workers_soft_limit, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RAY_CONFIG(int64_t, num_workers_soft_limit, -1)
RAY_CONFIG(int64_t, num_idle_workers_soft_limit, -1)

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 23, 2023
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stephanie-wang
Copy link
Contributor Author

Running nightly tests here.

@stephanie-wang stephanie-wang merged commit 24657be into ray-project:master Jun 27, 2023
@stephanie-wang stephanie-wang deleted the fix-worker-pool branch June 27, 2023 16:35
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
ray-project#36669 lists some issues found with workers not being reused. This turns out to have two root causes:

    We use the total number of CPUs as a soft limit for the total number of worker processes allowed. This leads to worker churn when there are some processes that are actively using resources but less than 1 CPU (e.g., actors), and other tasks that are trying to use the remaining cores. We will always end up over the soft limit, so we have to keep killing and restarting workers.
    (less serious) Workers are usually slow on their first task. Workers are selected for tasks in LIFO order and currently a worker that has just been started up is inserted last, which leads to slowdown on the next task. This isn't serious on its own, but becomes a performance issue when there is worker churn, and we end up killing a warmed up idle worker instead of the cold one.

This PR makes some changes to improve the idle worker killing:

    Use the available CPUs as a limit for the number of idle workers allowed, instead of total CPUs as a limit for total workers allowed. When no CPUs are being used and/or all tasks use exactly 1 CPU, the new policy is equivalent to the old one.
    The num_workers_soft_limit config override option is now used as a soft limit for idle workers instead of total workers.
    Workers that were just started are now inserted at the beginning of the idle worker queue so that they are prioritized to be killed over warmed up workers. They will be kept alive for at least the idle timeout as usual.

Related issue number

Closes ray-project#36669.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. Ray 2.6
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Ray is not reusing existing workers
5 participants