[core] Fix issues with worker churn in WorkerPool #36766

stephanie-wang · 2023-06-23T15:40:34Z

Why are these changes needed?

#36669 lists some issues found with workers not being reused. This turns out to have two root causes:

We use the total number of CPUs as a soft limit for the total number of worker processes allowed. This leads to worker churn when there are some processes that are actively using resources but less than 1 CPU (e.g., actors), and other tasks that are trying to use the remaining cores. We will always end up over the soft limit, so we have to keep killing and restarting workers.
(less serious) Workers are usually slow on their first task. Workers are selected for tasks in LIFO order and currently a worker that has just been started up is inserted last, which leads to slowdown on the next task. This isn't serious on its own, but becomes a performance issue when there is worker churn, and we end up killing a warmed up idle worker instead of the cold one.

This PR makes some changes to improve the idle worker killing:

Use the available CPUs as a limit for the number of idle workers allowed, instead of total CPUs as a limit for total workers allowed. When no CPUs are being used and/or all tasks use exactly 1 CPU, the new policy is equivalent to the old one.
The num_workers_soft_limit config override option is now used as a soft limit for idle workers instead of total workers.
Workers that were just started are now inserted at the beginning of the idle worker queue so that they are prioritized to be killed over warmed up workers. They will be kept alive for at least the idle timeout as usual.

Related issue number

Closes #36669.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl

LGTM. It's probably worth re-running the nighly tests prior to merging.

python/ray/tests/test_worker_capping.py

ericl · 2023-06-23T22:15:24Z

src/ray/common/ray_config_def.h

+/// The soft limit of the number of workers to keep around.
+/// We apply this limit to the idle workers instead of total workers,
+/// because the total number of workers used depends on the
+/// application. -1 means using the available number of CPUs.
 RAY_CONFIG(int64_t, num_workers_soft_limit, -1)


Suggested change

RAY_CONFIG(int64_t, num_workers_soft_limit, -1)

RAY_CONFIG(int64_t, num_idle_workers_soft_limit, -1)

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang · 2023-06-26T18:59:41Z

Running nightly tests here.

ray-project#36669 lists some issues found with workers not being reused. This turns out to have two root causes: We use the total number of CPUs as a soft limit for the total number of worker processes allowed. This leads to worker churn when there are some processes that are actively using resources but less than 1 CPU (e.g., actors), and other tasks that are trying to use the remaining cores. We will always end up over the soft limit, so we have to keep killing and restarting workers. (less serious) Workers are usually slow on their first task. Workers are selected for tasks in LIFO order and currently a worker that has just been started up is inserted last, which leads to slowdown on the next task. This isn't serious on its own, but becomes a performance issue when there is worker churn, and we end up killing a warmed up idle worker instead of the cold one. This PR makes some changes to improve the idle worker killing: Use the available CPUs as a limit for the number of idle workers allowed, instead of total CPUs as a limit for total workers allowed. When no CPUs are being used and/or all tasks use exactly 1 CPU, the new policy is equivalent to the old one. The num_workers_soft_limit config override option is now used as a soft limit for idle workers instead of total workers. Workers that were just started are now inserted at the beginning of the idle worker queue so that they are prioritized to be killed over warmed up workers. They will be kept alive for at least the idle timeout as usual. Related issue number Closes ray-project#36669. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

stephanie-wang added 4 commits June 23, 2023 10:30

Kill idle workers based on available CPUs instead of soft limit

047c410

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

set timestasmp to infinity for workers that just started

d85f373

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Use num_workers_soft_limit as an override

77b0cea

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

lint and prestart cleanup

e8b2222

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang assigned scv119, fishbone and ericl Jun 23, 2023

stephanie-wang added 2 commits June 23, 2023 10:48

comment

5a81272

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

comment

4c6510d

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang added the Ray 2.6 label Jun 23, 2023

jjyao assigned cadedaniel Jun 23, 2023

stephanie-wang added 4 commits June 23, 2023 11:22

python test

025ac56

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix cpp test

e91b499

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Don't need to record prestart

6173bf1

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

test

f9e0e32

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl approved these changes Jun 23, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 23, 2023

Fix broken tests

19f98d3

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang force-pushed the fix-worker-pool branch from 59bd2f4 to 19f98d3 Compare June 26, 2023 16:09

stephanie-wang merged commit 24657be into ray-project:master Jun 27, 2023

stephanie-wang deleted the fix-worker-pool branch June 27, 2023 16:35

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fix issues with worker churn in WorkerPool #36766

[core] Fix issues with worker churn in WorkerPool #36766

stephanie-wang commented Jun 23, 2023 •

edited

Loading

ericl left a comment

ericl Jun 23, 2023

stephanie-wang commented Jun 26, 2023

	RAY_CONFIG(int64_t, num_workers_soft_limit, -1)
	RAY_CONFIG(int64_t, num_idle_workers_soft_limit, -1)

[core] Fix issues with worker churn in WorkerPool #36766

[core] Fix issues with worker churn in WorkerPool #36766

Conversation

stephanie-wang commented Jun 23, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

ericl Jun 23, 2023

Choose a reason for hiding this comment

stephanie-wang commented Jun 26, 2023

stephanie-wang commented Jun 23, 2023 •

edited

Loading