Fix network error propagation #133

hannahhoward · 2020-12-16T03:16:08Z

Goals

I set out make TestNetworkDisconnect integration test reliable, and in the process identified several race conditions and issues related to network disconnects

Implementation

Testing and debugging identified 4 distinct problems, whose fixes are contained in this PR:

When a MessageQueue or PeerResponseSender is shut down due to disconnect, it may have outgoing messages that won't otherwise be processed -- we need to propagate network errors for these messages, since they will never be sent.
When we implemented Permit multiple data subscriptions per original topic #128, we introduced a bug where if a topic on a data subscriber is closed, we retain the mappings. If the same topic is later reopened (due to say, a new message queue being instantiated), the old subscribers would get messages. Now we clear mappings on close.
When the PeerResponseSender is shutdown, we shut down it's Publisher (that republishes events from the MessageQueue up to the calling party (QueryExecutor)). However we may have previously added messages to the MessageQueue that we will get later notifications for, that we still need to republish events up to the calling party for. Now, we add a waitgroup to PeerResponseSender. Everytime we send a message to the message queue, we add to the wait group. Every get the last notification for the message (Send or Error), we call Done. When we shutdown the PeerResponseSender, we call Wait on the waitgroup before shutting down the publisher.
In the query executor, we were caching a PeerResponseSender for the entire request execution. However, this is not safe to do, as we may get a disconnect in the middle of a request, meaning the cached PeerResponseSender shutdown. Now, we always get the PeerResponseSender on demand through the PeerResponseManager. As an additional optimization to avoid a slowdown due to this change, where previously SenderForPeer, which is a get or create operation, always took a full write lock, now it takes a read lock first to check for an existing PeerResponseSender, and if so, returns it. This avoids taking a write lock in the vast majority of cases.

peermanager/peermanager.go

dirkmc

LGTM 👍

peermanager/peermanager.go

fix various issues causing network errors not to propogate in many cases

dirkmc reviewed Dec 16, 2020

View reviewed changes

peermanager/peermanager.go Show resolved Hide resolved

peermanager/peermanager.go Show resolved Hide resolved

dirkmc approved these changes Dec 17, 2020

View reviewed changes

peermanager/peermanager.go Show resolved Hide resolved

fix(responsemanager): fix network error propogation

7d6cafd

fix various issues causing network errors not to propogate in many cases

hannahhoward force-pushed the feat/fix-flaky-test branch from 86d2faf to 7d6cafd Compare December 17, 2020 18:17

hannahhoward merged commit 1bdc558 into master Dec 17, 2020

dirkmc mentioned this pull request Feb 1, 2021

release: v1.2.8 filecoin-project/go-data-transfer#141

Merged

aschmahmann mentioned this pull request Feb 18, 2021

Release v0.8.0 ipfs/kubo#7707

Closed

73 tasks

dirkmc mentioned this pull request Mar 22, 2021

release: v1.2.0 filecoin-project/go-fil-markets#509

Merged

mvdan deleted the feat/fix-flaky-test branch December 15, 2021 14:16

marten-seemann pushed a commit that referenced this pull request Mar 2, 2023

feat: better push channel monitor logging (#133)

cb19d09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix network error propagation #133

Fix network error propagation #133

hannahhoward commented Dec 16, 2020

dirkmc left a comment

Fix network error propagation #133

Fix network error propagation #133

Conversation

hannahhoward commented Dec 16, 2020

Goals

Implementation

dirkmc left a comment

Choose a reason for hiding this comment