Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Compiled Graph] Fix shutdown error #48280

Merged
merged 8 commits into from
Oct 27, 2024

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 26, 2024

Why are these changes needed?

Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 added the go add ONLY when ready to merge, run all tests label Oct 26, 2024
self.wait_teardown(kill_actors)
return
# self.wait_teardown(kill_actors)
while True:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the root cause of segfault

I think there was a dangling monitor thread that was calling this API and waiting for ray.get, which is triggering segfault when the driver is shutdown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling this API and waiting for ray.get, which is triggering segfault when the driver is shutdown

  1. Why this function will be called after the driver process is shutdown?
  2. This function still calls ray.get and self.wait_teardown() below. Why does the issue disappear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it is not called "after" shutdown. but when shutdown happens, 2 threads (main and monitor) get into this wait_teardown, which calls ray.get. If the main thread one finishes, the shutdown finishes and the next test starts. If the monitor thread one is not finished before the main thread, the dangling thread still is waiting for ray.get, which basically crashes because ray is shutdown and core worker doesn't exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation!

@@ -2174,14 +2186,16 @@ def teardown(self, kill_actors: bool = False):
"""Teardown and cancel all actor tasks for this DAG. After this
function returns, the actors should be available to execute new tasks
or compile a new DAG."""
if self._is_teardown:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will make this API idempotent

r"Outputs cannot be transferred via NCCL because the driver cannot "
"participate in the NCCL group"
),
match=(r"Driver cannot participate in the NCCL group\."),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fix for accidental merge in the master

@@ -714,7 +700,7 @@ def test_torch_tensor_nccl_nested_dynamic(ray_start_regular):

for i in range(3):
dtype = torch.float16
args = [{j: (j, (10 * j,), dtype)} for j in range(1, i + 1)]
args = {j: (j, (10 * j,), dtype) for j in range(1, i + 1)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another accidental merge fix in the master

@@ -84,6 +84,10 @@ def __reduce__(self):
raise ValueError("CompiledDAGRef cannot be pickled.")

def __del__(self):
# If the dag is already teardown, it should do nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If the dag is already teardown, it should do nothing.
# If the DAG has already been torn down, it should do nothing.

@@ -351,6 +274,73 @@ def test_torch_tensor_custom_comm(ray_start_regular):

from cupy.cuda import nccl

class TestNcclGroup(GPUCommunicator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why there are two class TestNcclGroup in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fix @ruisearch42 made before. I am just reverting it back (in a local test, without this, test suite cannot find this folder for some reasons)

self.wait_teardown(kill_actors)
return
# self.wait_teardown(kill_actors)
while True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling this API and waiting for ray.get, which is triggering segfault when the driver is shutdown

  1. Why this function will be called after the driver process is shutdown?
  2. This function still calls ray.get and self.wait_teardown() below. Why does the issue disappear?

@kevin85421 kevin85421 self-assigned this Oct 26, 2024
@rkooo567 rkooo567 enabled auto-merge (squash) October 27, 2024 16:42
@rkooo567 rkooo567 merged commit 0b1d0d8 into ray-project:master Oct 27, 2024
6 checks passed
edoakes pushed a commit to edoakes/ray that referenced this pull request Oct 30, 2024
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged

Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged

Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants