-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: close socket to prevent onCompletion call after the journal stre… #4270
Conversation
@@ -81,6 +81,10 @@ class OutgoingMigration::SliceSlotMigration : private ProtocolClient { | |||
} | |||
|
|||
void Cancel() { | |||
// Close socket for clean disconnect. | |||
if (Sock()->IsOpen()) { | |||
std::ignore = Sock()->Shutdown(SHUT_RDWR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we close the socket when we cancel below ? Also you can use the CI to reproduce instead of merge + wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the stack trace is:
30001➜ @ 0xc2cb6e 32 absl::lts_20240116::InlinedVector<>::empty()
30001➜ @ 0xc2c47c 32 dfly::PendingBuf::Empty()::{lambda()#1}::operator()<>()
30001➜ @ 0xc2c4ae 48 __gnu_cxx::__ops::_Iter_negate<>::operator()<>()
30001➜ @ 0xc2ba19 112 std::__find_if<>()
30001➜ @ 0xc2ab38 128 std::__find_if_not<>()
30001➜ @ 0xc29a72 128 std::find_if_not<>()
30001➜ @ 0xc28e96 160 std::all_of<>()
30001➜ @ 0xc28848 112 dfly::PendingBuf::Empty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this stack trace shows that object was removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JournalStreamer is removed before Socket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call WaitForInflightToComplete
to wait for all the callbacks to return from JournalStreamer::Cancel()
.
If we have OnCompletion being called, it means WaitForInflightToComplete was not called, or something else is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitForInflightToComplete isn't called if the context was canceled.
9591100
to
9d90e5e
Compare
…amer is destroyed
9d90e5e
to
d8138bb
Compare
@@ -81,7 +81,8 @@ class OutgoingMigration::SliceSlotMigration : private ProtocolClient { | |||
} | |||
|
|||
void Cancel() { | |||
cntx_.Cancel(); | |||
// Close socket for clean disconnect. | |||
CloseSocket(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think streamer_.Cancel();
should work as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that we do the same for replica. And streamer_.Cancel(); doesn't work without close socket because we need to cancel OnCompletion callback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why we are not WaitForInflightToComplete inside JournalStreamer::Cancel if cntx is cancelled?
void JournalStreamer::Cancel() { VLOG(1) << "JournalStreamer::Cancel"; waker_.notifyAll(); journal_->UnregisterOnChange(journal_cb_id_); if (!cntx_->IsCancelled()) { WaitForInflightToComplete(); } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we can have different situations
- we have an error and cancel by error
- we cancel for our reason
- everything works good
- was canceled by an error in another fiber
These different scenarios have different behavior and WaitForInflightToComplete can wait up to 2 minutes (connection timeout) without any result so if we cancel the connection by our reason there is no sense to wait so long period of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view we should refactor JournalStreamer, but such refactoring affect replication and I don't think that we should do it right now
fixes: #4269
problem: OnCompletion is called after the JournalStreamer has been removed