-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better shorter timeouts on Graphsync Requests #3460
Conversation
2551a43
to
d91a6e6
Compare
Codecov Report
@@ Coverage Diff @@
## master #3460 +/- ##
=======================================
+ Coverage 44% 44% +<1%
=======================================
Files 239 242 +3
Lines 15411 15461 +50
=======================================
+ Hits 6859 6908 +49
+ Misses 7582 7572 -10
- Partials 970 981 +11 |
net/graphsync_fetcher.go
Outdated
// Timeout for a single graphsync request getting "stuck" | ||
// -- if no more responses are received for a period greater than this, | ||
// we will assume the request has hung-up and cancel it | ||
unresponsiveTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious -- do we have information on the upper bound of the delay we would expect with high probability from a peer with no network issues? My intuition is that we want to set this as low as we can reasonably get away with before we start killing productive connections. My uninformed intuition is also that 10 seconds is probably higher than we need and I'd love to know if this is wrong and 10 seconds is already pushing the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short version is I honestly don't know :( I would probably prefer to wait on dropping it lower till we are at least requesting two requests in parallel. Then if we get a false positive on believing ourselves unresponsive we can at least still pickup from the other request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option would be to track actual latency and use a multiple of the average latency between progress steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultProgressTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option would be to track actual latency
Not sure how it fits into priorities and I'm guessing this comes after other fires are put out but I love the idea of making this decision using observations. I'm imagining gathering a distribution of latencies over different data (single block, long chains) separating into "healthy" and "unhealthy" connections and taking the cutoff point to be some threshold (95% of latencies grouped into "healthy connections" for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start and thanks for the thorough testing, but the direct dependency on real time passing is a no-go. We're trying to root out the last of those soon.
net/graphsync_fetcher.go
Outdated
// Timeout for a single graphsync request getting "stuck" | ||
// -- if no more responses are received for a period greater than this, | ||
// we will assume the request has hung-up and cancel it | ||
unresponsiveTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultProgressTimeout
net/graphsync_fetcher_test.go
Outdated
ts, err := fetcher.FetchTipSets(ctx, final.Key(), pid0, done) | ||
mgs.verifyReceivedRequestCount(7) | ||
mgs.verifyExpectations() | ||
require.Errorf(t, err, "Failed fetching tipset: %s", final.Key().String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this does what you expect it to. I makes no assertion about the content of the error message. You need to check that more explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to be using ErrorEqual I think :)
a52c005
to
54b0c11
Compare
@anorth this should have issues resolved now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all this work.
I have to admit, the tests and mock behaviour are now pretty opaque. It may all be necessary complexity, but I think it's worth another think about how they could be constructed to be more direct.
} | ||
|
||
// NewGraphSyncFetcher returns a GraphsyncFetcher wired up to the input Graphsync exchange and | ||
// attached local blockservice for reloading blocks in memory once they are returned | ||
func NewGraphSyncFetcher(ctx context.Context, exchange GraphExchange, blockstore bstore.Blockstore, | ||
bv consensus.SyntaxValidator, pt graphsyncFallbackPeerTracker) *GraphSyncFetcher { | ||
bv consensus.SyntaxValidator, systemClock clock.Clock, pt graphsyncFallbackPeerTracker) *GraphSyncFetcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoying package/var name collision. How about clk
for the variable?
return anyError | ||
} | ||
|
||
func (gsf *GraphSyncFetcher) consumeResponse(requestChan <-chan graphsync.ResponseProgress, errChan <-chan error, cancelFunc func()) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please briefly describe this method's intent in a comment.
620525b
to
47ef7e4
Compare
a068dd4
to
94b8abe
Compare
47ef7e4
to
5cd0ffc
Compare
converts fetcher timeout to an unresponsiveness check -- if no data is received for 10 seconds, consider the request failed. also provides variadic options to the fetcher to override the unresponsiveness timeout
The graphsync fetcher now uses a clock as a dependency rather than relying on time directly.
Fix error tests to verify that error messages match what is expected
5cd0ffc
to
6edb0ae
Compare
Goals
When a peer becomes unresponsive to Graphsync Fetcher requests, fail more quickly and move on to other peers
Implementation
unresponsiveness timeout
fix #3371