-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] fifo worker killing policy #33430
Conversation
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
int right_retriable = | ||
right->GetAssignedTask().GetTaskSpecification().IsRetriable() ? 0 : 1; | ||
if (left_retriable == right_retriable) { | ||
return left->GetAssignedTaskTime() < right->GetAssignedTaskTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the only change comparing to LIFO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there is room to refactor, we need to figure out the right long term solution and productize that code
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
This is failing on master, even though it passes in this PR. I'm not sure why... |
cc @jjyao , what do you think about this? chen and clarence are out today =================================== FAILURES ===================================
_________________ test_one_actor_max_fifo_kill_previous_actor __________________
shutdown_only = None
@pytest.mark.skipif(
sys.platform != "linux" and sys.platform != "linux2",
reason="memory monitor only on linux currently",
)
def test_one_actor_max_fifo_kill_previous_actor(shutdown_only):
with ray.init(
_system_config={
"worker_killing_policy": "retriable_fifo",
"memory_usage_threshold": 0.4,
},
):
bytes_to_alloc = get_additional_bytes_to_reach_memory_usage_pct(0.3)
first_actor = Leaker.options(name="first_actor").remote()
ray.get(first_actor.allocate.remote(bytes_to_alloc))
actors = ray.util.list_named_actors()
assert len(actors) == 1
assert "first_actor" in actors
second_actor = Leaker.options(name="second_actor").remote()
ray.get(second_actor.allocate.remote(bytes_to_alloc))
actors = ray.util.list_named_actors()
> assert len(actors) == 1
E assert 2 == 1
E +2
E -1
python/ray/tests/test_memory_pressure.py:540: AssertionError |
Will take a look
On Mon, Mar 20, 2023 at 1:12 PM Cade Daniel ***@***.***> wrote:
cc @jjyao <https://github.com/jjyao> , what do you think about this? chen
and clarence are out today
=================================== FAILURES ===================================_________________ test_one_actor_max_fifo_kill_previous_actor __________________shutdown_only = None
@pytest.mark.skipif( sys.platform != "linux" and sys.platform != "linux2", reason="memory monitor only on linux currently", )
def test_one_actor_max_fifo_kill_previous_actor(shutdown_only):
with ray.init(
_system_config={
"worker_killing_policy": "retriable_fifo",
"memory_usage_threshold": 0.4,
},
):
bytes_to_alloc = get_additional_bytes_to_reach_memory_usage_pct(0.3)
first_actor = Leaker.options(name="first_actor").remote()
ray.get(first_actor.allocate.remote(bytes_to_alloc))
actors = ray.util.list_named_actors()
assert len(actors) == 1
assert "first_actor" in actors
second_actor = Leaker.options(name="second_actor").remote()
ray.get(second_actor.allocate.remote(bytes_to_alloc))
actors = ray.util.list_named_actors()> assert len(actors) == 1E assert 2 == 1E +2E -1python/ray/tests/test_memory_pressure.py:540: AssertionError
—
Reply to this email directly, view it on GitHub
<#33430 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFOFZ54X4XV3KNOOORM63HLW5CTZRANCNFSM6AAAAAAV7IXLGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Best,
Clarence
|
gunna revert now |
This reverts commit 872896f.
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in #33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…roject#33480) This reverts commit 872896f. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in ray-project#33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com>
…roject#33480) This reverts commit 872896f.
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: chaowang <chaowang@anyscale.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>
…roject#33480) This reverts commit 872896f. Signed-off-by: elliottower <elliot@elliottower.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in ray-project#33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place Signed-off-by: elliottower <elliot@elliottower.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard Co-authored-by: Clarence Ng <clarence@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
…roject#33480) This reverts commit 872896f. Signed-off-by: Jack He <jackhe2345@gmail.com>
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard This fixes the test failure introduced in ray-project#33430 Adding sleep to give time for memory monitor to kick in Also increasing the memory limit since the node may be using a lot of memory in the first place Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
For long-living, memory leaking actors, it is more desirable to kill oldest task that is leaking the most. This avoid the situation where we constantly kill actor, which may lead to side effects where we generate a lot of log files, or trigger increased memory consumption in gcs / dashboard
Related issue number
https://github.com/anyscale/product/issues/18727
https://github.com/anyscale/product/issues/18728
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.