-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Compiled Graph] Fix shutdown error #48280
Conversation
self.wait_teardown(kill_actors) | ||
return | ||
# self.wait_teardown(kill_actors) | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the root cause of segfault
I think there was a dangling monitor thread that was calling this API and waiting for ray.get, which is triggering segfault when the driver is shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling this API and waiting for ray.get, which is triggering segfault when the driver is shutdown
- Why this function will be called after the driver process is shutdown?
- This function still calls
ray.get
andself.wait_teardown()
below. Why does the issue disappear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- it is not called "after" shutdown. but when shutdown happens, 2 threads (main and monitor) get into this wait_teardown, which calls ray.get. If the main thread one finishes, the shutdown finishes and the next test starts. If the monitor thread one is not finished before the main thread, the dangling thread still is waiting for ray.get, which basically crashes because ray is shutdown and core worker doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation!
@@ -2174,14 +2186,16 @@ def teardown(self, kill_actors: bool = False): | |||
"""Teardown and cancel all actor tasks for this DAG. After this | |||
function returns, the actors should be available to execute new tasks | |||
or compile a new DAG.""" | |||
if self._is_teardown: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will make this API idempotent
r"Outputs cannot be transferred via NCCL because the driver cannot " | ||
"participate in the NCCL group" | ||
), | ||
match=(r"Driver cannot participate in the NCCL group\."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a fix for accidental merge in the master
@@ -714,7 +700,7 @@ def test_torch_tensor_nccl_nested_dynamic(ray_start_regular): | |||
|
|||
for i in range(3): | |||
dtype = torch.float16 | |||
args = [{j: (j, (10 * j,), dtype)} for j in range(1, i + 1)] | |||
args = {j: (j, (10 * j,), dtype) for j in range(1, i + 1)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another accidental merge fix in the master
@@ -84,6 +84,10 @@ def __reduce__(self): | |||
raise ValueError("CompiledDAGRef cannot be pickled.") | |||
|
|||
def __del__(self): | |||
# If the dag is already teardown, it should do nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# If the dag is already teardown, it should do nothing. | |
# If the DAG has already been torn down, it should do nothing. |
@@ -351,6 +274,73 @@ def test_torch_tensor_custom_comm(ray_start_regular): | |||
|
|||
from cupy.cuda import nccl | |||
|
|||
class TestNcclGroup(GPUCommunicator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why there are two class TestNcclGroup
in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the fix @ruisearch42 made before. I am just reverting it back (in a local test, without this, test suite cannot find this folder for some reasons)
self.wait_teardown(kill_actors) | ||
return | ||
# self.wait_teardown(kill_actors) | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling this API and waiting for ray.get, which is triggering segfault when the driver is shutdown
- Why this function will be called after the driver process is shutdown?
- This function still calls
ray.get
andself.wait_teardown()
below. Why does the issue disappear?
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
Fixes remaining shutdown issues. It also fixes some failures in the nightly we accidently merged
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.