-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end() #47861
Conversation
… sync_reactors_.end() Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. one minor question
@@ -124,6 +125,77 @@ def check_task_pending(n=0): | |||
wait_for_condition(lambda: check_task_pending(2)) | |||
|
|||
|
|||
head2 = gen_head_node( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test seems very useful. wonder if we should extend it to more e2e tests in some stress tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'll use the same pattern for other tests. I'm also exploring chaos test tools for randomly injecting network errors.
// Disconnect exiting connection if there is any. | ||
// This can happen when there is transient network error | ||
// and the client reconnects. | ||
syncer_.Disconnect(reactor->GetRemoteNodeID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: why don't we call this in clean_cb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syncer_.Disconnect
triggers the call of cleanup_cb
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…D()) == sync_reactors_.end() (ray-project#47861) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Why are these changes needed?
The sequence of events that triggers this check failure is
RayClientBidiReactor::OnDone
is called with non-ok grpc status.cleanup_cb_
ofRayClientBidiReactor
is called which reconnects to the head node.RaySyncerService::StartSync
to be called.RaySyncer::Connect
is called with a newRayServerBidiReactor
.RayServerBidiReactor
from the worker node's previous connection, the check fails.This PR fixes the issue by disconnecting the old
RayServerBidiReactor
first before adding a new one.Related issue number
Closes #45639
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.