[core] fifo worker killing policy #33430

clarng · 2023-03-18T05:56:29Z

Why are these changes needed?

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Related issue number

https://github.com/anyscale/product/issues/18727
https://github.com/anyscale/product/issues/18728

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

scv119 · 2023-03-18T17:17:53Z

src/ray/raylet/worker_killing_policy_retriable_fifo.cc

+              int right_retriable =
+                  right->GetAssignedTask().GetTaskSpecification().IsRetriable() ? 0 : 1;
+              if (left_retriable == right_retriable) {
+                return left->GetAssignedTaskTime() < right->GetAssignedTaskTime();


is this the only change comparing to LIFO?

yes, there is room to refactor, we need to figure out the right long term solution and productize that code

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

cadedaniel · 2023-03-20T19:08:01Z

This is failing on master, even though it passes in this PR. I'm not sure why...

cadedaniel · 2023-03-20T19:12:11Z

cc @jjyao , what do you think about this? chen and clarence are out today

=================================== FAILURES ===================================
_________________ test_one_actor_max_fifo_kill_previous_actor __________________
shutdown_only = None
    @pytest.mark.skipif(
        sys.platform != "linux" and sys.platform != "linux2",
        reason="memory monitor only on linux currently",
    )
    def test_one_actor_max_fifo_kill_previous_actor(shutdown_only):
        with ray.init(
            _system_config={
                "worker_killing_policy": "retriable_fifo",
                "memory_usage_threshold": 0.4,
            },
        ):
            bytes_to_alloc = get_additional_bytes_to_reach_memory_usage_pct(0.3)
    
            first_actor = Leaker.options(name="first_actor").remote()
            ray.get(first_actor.allocate.remote(bytes_to_alloc))
    
            actors = ray.util.list_named_actors()
            assert len(actors) == 1
            assert "first_actor" in actors
    
            second_actor = Leaker.options(name="second_actor").remote()
            ray.get(second_actor.allocate.remote(bytes_to_alloc))
    
            actors = ray.util.list_named_actors()
>           assert len(actors) == 1
E           assert 2 == 1
E             +2
E             -1
python/ray/tests/test_memory_pressure.py:540: AssertionError

clarng · 2023-03-20T19:15:02Z

Will take a look

On Mon, Mar 20, 2023 at 1:12 PM Cade Daniel ***@***.***> wrote: cc @jjyao <https://github.com/jjyao> , what do you think about this? chen and clarence are out today =================================== FAILURES ===================================_________________ test_one_actor_max_fifo_kill_previous_actor __________________shutdown_only = None @pytest.mark.skipif( sys.platform != "linux" and sys.platform != "linux2", reason="memory monitor only on linux currently", ) def test_one_actor_max_fifo_kill_previous_actor(shutdown_only): with ray.init( _system_config={ "worker_killing_policy": "retriable_fifo", "memory_usage_threshold": 0.4, }, ): bytes_to_alloc = get_additional_bytes_to_reach_memory_usage_pct(0.3) first_actor = Leaker.options(name="first_actor").remote() ray.get(first_actor.allocate.remote(bytes_to_alloc)) actors = ray.util.list_named_actors() assert len(actors) == 1 assert "first_actor" in actors second_actor = Leaker.options(name="second_actor").remote() ray.get(second_actor.allocate.remote(bytes_to_alloc)) actors = ray.util.list_named_actors()> assert len(actors) == 1E assert 2 == 1E +2E -1python/ray/tests/test_memory_pressure.py:540: AssertionError — Reply to this email directly, view it on GitHub <#33430 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFOFZ54X4XV3KNOOORM63HLW5CTZRANCNFSM6AAAAAAV7IXLGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Best, Clarence

cadedaniel · 2023-03-20T21:42:04Z

gunna revert now

This reverts commit 872896f.

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in #33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…roject#33480) This reverts commit 872896f. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in ray-project#33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com>

…roject#33480) This reverts commit 872896f.

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: chaowang <chaowang@anyscale.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

…roject#33480) This reverts commit 872896f. Signed-off-by: elliottower <elliot@elliottower.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in ray-project#33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place Signed-off-by: elliottower <elliot@elliottower.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

…roject#33480) This reverts commit 872896f. Signed-off-by: Jack He <jackhe2345@gmail.com>

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in ray-project#33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place Signed-off-by: Jack He <jackhe2345@gmail.com>

clarng and others added 13 commits February 9, 2023 04:58

dask

91bb17a

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

cf9481c

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

3c22597

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

7ba7e5d

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

601fae9

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

1df3ee1

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

3539a18

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

c47b721

Merge branch 'master' of https://github.com/ray-project/ray

2a3891e

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray

c8f01ec

Merge branch 'master' of https://github.com/clarng/ray

87c48ea

Merge branch 'master' of https://github.com/ray-project/ray

98dce1b

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

[core] add retriable fifo policy

11d147e

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

clarng requested a review from a team March 18, 2023 06:02

clarng marked this pull request as ready for review March 18, 2023 06:07

scv119 reviewed Mar 18, 2023

View reviewed changes

scv119 approved these changes Mar 18, 2023

View reviewed changes

clarng added 2 commits March 18, 2023 20:40

[core] add retriable fifo policy

b70c0c4

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

[core] add retriable fifo policy

758e5c7

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>

scv119 merged commit 872896f into ray-project:master Mar 19, 2023

cadedaniel added a commit that referenced this pull request Mar 20, 2023

Revert "[core] fifo worker killing policy (#33430)"

c9d11ce

This reverts commit 872896f.

cadedaniel mentioned this pull request Mar 20, 2023

Revert "[core] fifo worker killing policy" #33480

Merged

rkooo567 pushed a commit that referenced this pull request Mar 20, 2023

Revert "[core] fifo worker killing policy (#33430)" (#33480)

ed59403

This reverts commit 872896f.

clarng mentioned this pull request Mar 21, 2023

[core] add retriable fifo policy #33514

Merged

8 tasks

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023

Revert "[core] fifo worker killing policy (ray-project#33430)" (ray-p…

35304fa

…roject#33480) This reverts commit 872896f. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

clarng pushed a commit to clarng/ray that referenced this pull request Mar 23, 2023

Revert "[core] fifo worker killing policy (ray-project#33430)" (ray-p…

6833699

…roject#33480) This reverts commit 872896f.

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

Revert "[core] fifo worker killing policy (ray-project#33430)" (ray-p…

4c2607a

…roject#33480) This reverts commit 872896f. Signed-off-by: elliottower <elliot@elliottower.com>

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

Revert "[core] fifo worker killing policy (ray-project#33430)" (ray-p…

9d9f52a

…roject#33480) This reverts commit 872896f. Signed-off-by: Jack He <jackhe2345@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] fifo worker killing policy #33430

[core] fifo worker killing policy #33430

clarng commented Mar 18, 2023 •

edited

Loading

scv119 Mar 18, 2023

clarng Mar 19, 2023

cadedaniel commented Mar 20, 2023

cadedaniel commented Mar 20, 2023

clarng commented Mar 20, 2023 via email

cadedaniel commented Mar 20, 2023

[core] fifo worker killing policy #33430

[core] fifo worker killing policy #33430

Conversation

clarng commented Mar 18, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

scv119 Mar 18, 2023

Choose a reason for hiding this comment

clarng Mar 19, 2023

Choose a reason for hiding this comment

cadedaniel commented Mar 20, 2023

cadedaniel commented Mar 20, 2023

clarng commented Mar 20, 2023 via email

cadedaniel commented Mar 20, 2023

clarng commented Mar 18, 2023 •

edited

Loading