Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: close socket to prevent onCompletion call after the journal stre… #4270

Merged
merged 1 commit into from
Dec 8, 2024

Conversation

BorysTheDev
Copy link
Contributor

@BorysTheDev BorysTheDev commented Dec 6, 2024

fixes: #4269

problem: OnCompletion is called after the JournalStreamer has been removed

@BorysTheDev BorysTheDev requested a review from kostasrim December 6, 2024 10:49
kostasrim
kostasrim previously approved these changes Dec 6, 2024
@@ -81,6 +81,10 @@ class OutgoingMigration::SliceSlotMigration : private ProtocolClient {
}

void Cancel() {
// Close socket for clean disconnect.
if (Sock()->IsOpen()) {
std::ignore = Sock()->Shutdown(SHUT_RDWR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we close the socket when we cancel below ? Also you can use the CI to reproduce instead of merge + wait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the stack trace is:

30001➜    @           0xc2cb6e         32  absl::lts_20240116::InlinedVector<>::empty()
30001➜    @           0xc2c47c         32  dfly::PendingBuf::Empty()::{lambda()#1}::operator()<>()
30001➜    @           0xc2c4ae         48  __gnu_cxx::__ops::_Iter_negate<>::operator()<>()
30001➜    @           0xc2ba19        112  std::__find_if<>()
30001➜    @           0xc2ab38        128  std::__find_if_not<>()
30001➜    @           0xc29a72        128  std::find_if_not<>()
30001➜    @           0xc28e96        160  std::all_of<>()
30001➜    @           0xc28848        112  dfly::PendingBuf::Empty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this stack trace shows that object was removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JournalStreamer is removed before Socket

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call WaitForInflightToComplete to wait for all the callbacks to return from JournalStreamer::Cancel().
If we have OnCompletion being called, it means WaitForInflightToComplete was not called, or something else is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForInflightToComplete isn't called if the context was canceled.

@BorysTheDev BorysTheDev force-pushed the journal_streamer_crash_during_teardown branch 2 times, most recently from 9591100 to 9d90e5e Compare December 6, 2024 12:02
@BorysTheDev BorysTheDev force-pushed the journal_streamer_crash_during_teardown branch from 9d90e5e to d8138bb Compare December 6, 2024 12:45
@@ -81,7 +81,8 @@ class OutgoingMigration::SliceSlotMigration : private ProtocolClient {
}

void Cancel() {
cntx_.Cancel();
// Close socket for clean disconnect.
CloseSocket();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think streamer_.Cancel(); should work as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that we do the same for replica. And streamer_.Cancel(); doesn't work without close socket because we need to cancel OnCompletion callback

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why we are not WaitForInflightToComplete inside JournalStreamer::Cancel if cntx is cancelled?

void JournalStreamer::Cancel() { VLOG(1) << "JournalStreamer::Cancel"; waker_.notifyAll(); journal_->UnregisterOnChange(journal_cb_id_); if (!cntx_->IsCancelled()) { WaitForInflightToComplete(); } }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we can have different situations

  1. we have an error and cancel by error
  2. we cancel for our reason
  3. everything works good
  4. was canceled by an error in another fiber

These different scenarios have different behavior and WaitForInflightToComplete can wait up to 2 minutes (connection timeout) without any result so if we cancel the connection by our reason there is no sense to wait so long period of time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view we should refactor JournalStreamer, but such refactoring affect replication and I don't think that we should do it right now

@BorysTheDev BorysTheDev requested a review from adiholden December 8, 2024 09:13
@BorysTheDev BorysTheDev merged commit 32ad00b into main Dec 8, 2024
9 checks passed
@BorysTheDev BorysTheDev deleted the journal_streamer_crash_during_teardown branch December 8, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hard crash on test_migration_timeout_on_sync
4 participants