Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] On GCS restart, destroy not forget the unused workers. Fixing PG leaks. #45854

Merged
merged 7 commits into from
Jun 13, 2024

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Jun 10, 2024

On GCS restart, it prunes any worker it did not bookkeep, via the RPC ReleaseUnusedWorkers. Raylet receives it and removes any workers not in a worker_ids_in_use list. However, it's only forgetting the worker by removing it from the leased_workers_ not killing the worker, but the worker can be still in running and never gets killed, hence leaked.

The evidence leading to this PR is, in a Ray cluster I observed GCS failed to kill a placement group: it sends "pls remove all workers in this PG", and raylet did not kill any workers, and then it finds the PG is still in used and replied "failed, pls retry" for many minutes, and finally GCS gave up and the PG is removed from the GCS, and the worker leaked. I found the PG holding worker is NOT in leased_workers_, but it IS known by worker_pool, since it shows up in worker pool's state dump.

The reason behind the above case is, when GCS is restarting, it wrongly thinks the worker is no longer needed (due to another bug (#45791 actors whose parent is detached actor is considered dead but should be alive), and orders Raylet to release it. Raylet, as fixed in this PR, simply forgets the worker but does not kill it.

To make things more clear, here is a timeline:

  1. Creates a detached actor Parent, which creates a PG pg and a normal actor Child in pg
  2. Job for Parent dies. Everyone should live on, since everyone's lifetime = detached Parent's
  3. GCS restarts
  4. BUG1 (Fixed by [core] On GCS restart, load actors whose job is dead but root detached actor is alive. #45791): GCS thinks Child should be die, and,
    • did not load it (it's forgotten by GCS). This makes the child is never killed later
      • should be: load it, can kill it later
    • sends ReleaseUnusedWorkers to Raylet with this worker as "not used"
      • should be: mark this as "used" so it's not killed by Raylet
  5. BUG2 (this bug): Raylet should kill the worker since it's "not used" by GCS. However it simply forgets it.
    • this makes the worker leak on Raylet side: later when it needs to be killed (by PG removal), it can not.
    • should be: kill it (DestroyWorker) on the RPC, so we could have exposed the first bug earlier.
  6. Because of BUG1, when the detached actor Parent is dead, nobody is killing Child actor. (forgot by GCS)
  7. Because of BUG2, when the PG is removed since Parent is dead, the PG removal failed with a worker leak (forgot by raylet)

Note since FIX1 (#45791) is merged, Raylet should no longer kill the detached actor on RPC because it's considered used. This FIX2 is just for the sake of completeness, and for future leaks.

In this PR:

  • On ReleaseUnusedWorkers, instead of forgetting the workers we call DestroyWorker to kill it
  • In state dump, print the actors' worker ID for debugging ease.
  • Removed useless ReleaseWorkers method
  • Comments
  • Renamed RPC to ReleaseUnusedActorWorkers to make it dead clear.

Fixes #45598.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang requested a review from a team as a code owner June 11, 2024 18:49
@rynewang rynewang force-pushed the destroy-unused-workers branch 2 times, most recently from 9585d20 to 547658e Compare June 11, 2024 21:42
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Lets add some tests

src/ray/raylet/node_manager.cc Outdated Show resolved Hide resolved
src/ray/raylet/node_manager.h Outdated Show resolved Hide resolved
src/ray/raylet/node_manager.h Outdated Show resolved Hide resolved
rynewang and others added 4 commits June 12, 2024 10:38
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jun 13, 2024
@jjyao jjyao merged commit 49e452e into ray-project:master Jun 13, 2024
7 checks passed
@rynewang rynewang deleted the destroy-unused-workers branch June 13, 2024 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Removed Placement group did not kill actors, if its ancestors are from detached actors.
2 participants