[Ray Core] The node storing the actor will be kill unexpectedly when autoscaler is turned on #46172

yx367563 · 2024-06-21T01:33:11Z

What happened + What you expected to happen

The scenario is that an actor is created to synchronise some data between workers, but if the data in the actor is not updated for a period of time, the node in which the actor resides may be determined to be idle and deleted by the autoscaler, which doesn't meet my expectations because I still hold the actor's handler
Expected behavior: if there is still an actor alive on a node, it should not be deleted.

Versions / Dependencies

Ray 2.23.0

Reproduction script

import time
import ray

@ray.remote
class Cache:
  def __init__(self):
    self.cache = {}

  def put(self, x, y):
    self.cache[x] = y

  def get(self, x):
    return self.cache.get(x)

@ray.remote
def ray_func(global_cache):
    test_count = ray.get(global_cache.get.remote("test_count"))
    test_count += 1
    ray.get(global_cache.put.remote("test_count", test_count))
    return

def test_autoscaler_actor():
    ray.init()

    global_cache = Cache.remote()
    ray.get(global_cache.put.remote("test_count", 0)) # Record the number of retries of count_record2

    future = ray_func.remote(global_cache) # First allocated to Node1
    ray.get(future)
    time.sleep(120)

    test_count = ray.get(global_cache.get.remote("test_count"))
    print("test_count = {:d}".format(test_count))

if __name__ == '__main__':
    test_autoscaler_actor()

The above Python script reproduces the bug reliably, reporting The actor is dead because its node has died. after time.sleep(120). The configured idleTimeoutSeconds is 60.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

jjyao · 2024-07-01T21:24:58Z

This is because by default actor doesn't use resource so the node is considered idle. To workaround it do

@ray.remote(num_cpus=1)
class Cache:

then the node wont be marked as idle since 1 cpu resource is in use.

yx367563 · 2024-07-02T01:06:17Z

@jjyao Thank you for your answer! I think this solves my problem very well.

yx367563 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 21, 2024

anyscalesam added the core Issues that should be addressed in Ray Core label Jun 21, 2024

jjyao added P1 Issue that should be fixed within a few weeks core-autoscaler autoscaler related issues P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Jul 1, 2024

jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024

kevin85421 mentioned this issue Dec 11, 2024

[Umbrella] Autoscaler improvements ray-project/kuberay#2600

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Core] The node storing the actor will be kill unexpectedly when autoscaler is turned on #46172

[Ray Core] The node storing the actor will be kill unexpectedly when autoscaler is turned on #46172

yx367563 commented Jun 21, 2024

jjyao commented Jul 1, 2024

yx367563 commented Jul 2, 2024

[Ray Core] The node storing the actor will be kill unexpectedly when autoscaler is turned on #46172

[Ray Core] The node storing the actor will be kill unexpectedly when autoscaler is turned on #46172

Comments

yx367563 commented Jun 21, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jjyao commented Jul 1, 2024

yx367563 commented Jul 2, 2024