Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fifo worker killing policy #33430

Merged
merged 15 commits into from
Mar 19, 2023
Merged

[core] fifo worker killing policy #33430

merged 15 commits into from
Mar 19, 2023

Conversation

clarng
Copy link
Contributor

@clarng clarng commented Mar 18, 2023

Why are these changes needed?

For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Related issue number

https://github.com/anyscale/product/issues/18727
https://github.com/anyscale/product/issues/18728

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

clarng and others added 13 commits February 9, 2023 04:58
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@clarng clarng requested a review from a team March 18, 2023 06:02
@clarng clarng marked this pull request as ready for review March 18, 2023 06:07
int right_retriable =
right->GetAssignedTask().GetTaskSpecification().IsRetriable() ? 0 : 1;
if (left_retriable == right_retriable) {
return left->GetAssignedTaskTime() < right->GetAssignedTaskTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the only change comparing to LIFO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is room to refactor, we need to figure out the right long term solution and productize that code

clarng added 2 commits March 18, 2023 20:40
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@scv119 scv119 merged commit 872896f into ray-project:master Mar 19, 2023
@cadedaniel
Copy link
Member

This is failing on master, even though it passes in this PR. I'm not sure why...

@cadedaniel
Copy link
Member

cc @jjyao , what do you think about this? chen and clarence are out today

=================================== FAILURES ===================================
_________________ test_one_actor_max_fifo_kill_previous_actor __________________
shutdown_only = None
    @pytest.mark.skipif(
        sys.platform != "linux" and sys.platform != "linux2",
        reason="memory monitor only on linux currently",
    )
    def test_one_actor_max_fifo_kill_previous_actor(shutdown_only):
        with ray.init(
            _system_config={
                "worker_killing_policy": "retriable_fifo",
                "memory_usage_threshold": 0.4,
            },
        ):
            bytes_to_alloc = get_additional_bytes_to_reach_memory_usage_pct(0.3)
    
            first_actor = Leaker.options(name="first_actor").remote()
            ray.get(first_actor.allocate.remote(bytes_to_alloc))
    
            actors = ray.util.list_named_actors()
            assert len(actors) == 1
            assert "first_actor" in actors
    
            second_actor = Leaker.options(name="second_actor").remote()
            ray.get(second_actor.allocate.remote(bytes_to_alloc))
    
            actors = ray.util.list_named_actors()
>           assert len(actors) == 1
E           assert 2 == 1
E             +2
E             -1
python/ray/tests/test_memory_pressure.py:540: AssertionError

@clarng
Copy link
Contributor Author

clarng commented Mar 20, 2023 via email

@cadedaniel
Copy link
Member

gunna revert now

cadedaniel added a commit that referenced this pull request Mar 20, 2023
rkooo567 pushed a commit that referenced this pull request Mar 20, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Co-authored-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
@clarng clarng mentioned this pull request Mar 21, 2023
8 tasks
rkooo567 pushed a commit that referenced this pull request Mar 21, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

This fixes the test failure introduced in #33430

Adding sleep to give time for memory monitor to kick in
Also increasing the memory limit since the node may be using a lot of memory in the first place
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Co-authored-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…roject#33480)

This reverts commit 872896f.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

This fixes the test failure introduced in ray-project#33430

Adding sleep to give time for memory monitor to kick in
Also increasing the memory limit since the node may be using a lot of memory in the first place

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
clarng added a commit to clarng/ray that referenced this pull request Mar 23, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Co-authored-by: Clarence Ng <clarence@anyscale.com>
clarng pushed a commit to clarng/ray that referenced this pull request Mar 23, 2023
chaowanggg pushed a commit to chaowanggg/ray-dev that referenced this pull request Apr 4, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Co-authored-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: chaowang <chaowang@anyscale.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Co-authored-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: elliottower <elliot@elliottower.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…roject#33480)

This reverts commit 872896f.

Signed-off-by: elliottower <elliot@elliottower.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

This fixes the test failure introduced in ray-project#33430

Adding sleep to give time for memory monitor to kick in
Also increasing the memory limit since the node may be using a lot of memory in the first place

Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

Co-authored-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…roject#33480)

This reverts commit 872896f.

Signed-off-by: Jack He <jackhe2345@gmail.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard

This fixes the test failure introduced in ray-project#33430

Adding sleep to give time for memory monitor to kick in
Also increasing the memory limit since the node may be using a lot of memory in the first place

Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants