Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix network error propagation #133

Merged
merged 1 commit into from
Dec 17, 2020
Merged

Fix network error propagation #133

merged 1 commit into from
Dec 17, 2020

Conversation

hannahhoward
Copy link
Collaborator

Goals

I set out make TestNetworkDisconnect integration test reliable, and in the process identified several race conditions and issues related to network disconnects

Implementation

Testing and debugging identified 4 distinct problems, whose fixes are contained in this PR:

  1. When a MessageQueue or PeerResponseSender is shut down due to disconnect, it may have outgoing messages that won't otherwise be processed -- we need to propagate network errors for these messages, since they will never be sent.
  2. When we implemented Permit multiple data subscriptions per original topic #128, we introduced a bug where if a topic on a data subscriber is closed, we retain the mappings. If the same topic is later reopened (due to say, a new message queue being instantiated), the old subscribers would get messages. Now we clear mappings on close.
  3. When the PeerResponseSender is shutdown, we shut down it's Publisher (that republishes events from the MessageQueue up to the calling party (QueryExecutor)). However we may have previously added messages to the MessageQueue that we will get later notifications for, that we still need to republish events up to the calling party for. Now, we add a waitgroup to PeerResponseSender. Everytime we send a message to the message queue, we add to the wait group. Every get the last notification for the message (Send or Error), we call Done. When we shutdown the PeerResponseSender, we call Wait on the waitgroup before shutting down the publisher.
  4. In the query executor, we were caching a PeerResponseSender for the entire request execution. However, this is not safe to do, as we may get a disconnect in the middle of a request, meaning the cached PeerResponseSender shutdown. Now, we always get the PeerResponseSender on demand through the PeerResponseManager. As an additional optimization to avoid a slowdown due to this change, where previously SenderForPeer, which is a get or create operation, always took a full write lock, now it takes a read lock first to check for an existing PeerResponseSender, and if so, returns it. This avoids taking a write lock in the vast majority of cases.

peermanager/peermanager.go Show resolved Hide resolved
peermanager/peermanager.go Show resolved Hide resolved
Copy link
Collaborator

@dirkmc dirkmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

peermanager/peermanager.go Show resolved Hide resolved
fix various issues causing network errors not to propogate in many cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants