Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Core] The node storing the actor will be kill unexpectedly when autoscaler is turned on #46172

Open
Tracked by #2600
yx367563 opened this issue Jun 21, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P3 Issue moderate in impact or severity

Comments

@yx367563
Copy link

What happened + What you expected to happen

The scenario is that an actor is created to synchronise some data between workers, but if the data in the actor is not updated for a period of time, the node in which the actor resides may be determined to be idle and deleted by the autoscaler, which doesn't meet my expectations because I still hold the actor's handler
Expected behavior: if there is still an actor alive on a node, it should not be deleted.

Versions / Dependencies

Ray 2.23.0

Reproduction script

import time
import ray

@ray.remote
class Cache:
  def __init__(self):
    self.cache = {}

  def put(self, x, y):
    self.cache[x] = y

  def get(self, x):
    return self.cache.get(x)

@ray.remote
def ray_func(global_cache):
    test_count = ray.get(global_cache.get.remote("test_count"))
    test_count += 1
    ray.get(global_cache.put.remote("test_count", test_count))
    return

def test_autoscaler_actor():
    ray.init()

    global_cache = Cache.remote()
    ray.get(global_cache.put.remote("test_count", 0)) # Record the number of retries of count_record2

    future = ray_func.remote(global_cache) # First allocated to Node1
    ray.get(future)
    time.sleep(120)

    test_count = ray.get(global_cache.get.remote("test_count"))
    print("test_count = {:d}".format(test_count))

if __name__ == '__main__':
    test_autoscaler_actor()

The above Python script reproduces the bug reliably, reporting The actor is dead because its node has died. after time.sleep(120). The configured idleTimeoutSeconds is 60.

Issue Severity

High: It blocks me from completing my task.

@yx367563 yx367563 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 21, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Jun 21, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jul 1, 2024

This is because by default actor doesn't use resource so the node is considered idle. To workaround it do

@ray.remote(num_cpus=1)
class Cache:

then the node wont be marked as idle since 1 cpu resource is in use.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks core-autoscaler autoscaler related issues P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Jul 1, 2024
@yx367563
Copy link
Author

yx367563 commented Jul 2, 2024

@jjyao Thank you for your answer! I think this solves my problem very well.

@jjyao jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

3 participants