Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better shorter timeouts on Graphsync Requests #3460

Merged
merged 3 commits into from
Sep 26, 2019

Conversation

hannahhoward
Copy link
Contributor

Goals

When a peer becomes unresponsive to Graphsync Fetcher requests, fail more quickly and move on to other peers

Implementation

  • Timeout for a graphsync request is now based on unresponsiveness, rather than total time. IOW -- if a request stop sending data for a specific time, cancel it.
  • Because the timeout is based on unresponsiveness rather than total time, use a shorter timeout than before (10 seconds)
  • Also provides variadic options struct to the fetcher to override the
    unresponsiveness timeout
  • Add tests of timeout behavior to verify behavior when various requests simply hangup with no further response

fix #3371

@hannahhoward hannahhoward force-pushed the feat/graphsync-timeout-improvement branch from 2551a43 to d91a6e6 Compare September 18, 2019 21:23
@codecov-io
Copy link

codecov-io commented Sep 18, 2019

Codecov Report

Merging #3460 into master will increase coverage by <1%.
The diff coverage is 88%.

@@           Coverage Diff           @@
##           master   #3460    +/-   ##
=======================================
+ Coverage      44%     44%   +<1%     
=======================================
  Files         239     242     +3     
  Lines       15411   15461    +50     
=======================================
+ Hits         6859    6908    +49     
+ Misses       7582    7572    -10     
- Partials      970     981    +11

// Timeout for a single graphsync request getting "stuck"
// -- if no more responses are received for a period greater than this,
// we will assume the request has hung-up and cancel it
unresponsiveTimeout = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious -- do we have information on the upper bound of the delay we would expect with high probability from a peer with no network issues? My intuition is that we want to set this as low as we can reasonably get away with before we start killing productive connections. My uninformed intuition is also that 10 seconds is probably higher than we need and I'd love to know if this is wrong and 10 seconds is already pushing the limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short version is I honestly don't know :( I would probably prefer to wait on dropping it lower till we are at least requesting two requests in parallel. Then if we get a false positive on believing ourselves unresponsive we can at least still pickup from the other request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option would be to track actual latency and use a multiple of the average latency between progress steps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultProgressTimeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option would be to track actual latency

Not sure how it fits into priorities and I'm guessing this comes after other fires are put out but I love the idea of making this decision using observations. I'm imagining gathering a distribution of latencies over different data (single block, long chains) separating into "healthy" and "unhealthy" connections and taking the cutoff point to be some threshold (95% of latencies grouped into "healthy connections" for example).

net/graphsync_fetcher.go Outdated Show resolved Hide resolved
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start and thanks for the thorough testing, but the direct dependency on real time passing is a no-go. We're trying to root out the last of those soon.

net/graphsync_fetcher.go Outdated Show resolved Hide resolved
// Timeout for a single graphsync request getting "stuck"
// -- if no more responses are received for a period greater than this,
// we will assume the request has hung-up and cancel it
unresponsiveTimeout = 10 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultProgressTimeout

net/graphsync_fetcher.go Outdated Show resolved Hide resolved
net/graphsync_fetcher.go Outdated Show resolved Hide resolved
ts, err := fetcher.FetchTipSets(ctx, final.Key(), pid0, done)
mgs.verifyReceivedRequestCount(7)
mgs.verifyExpectations()
require.Errorf(t, err, "Failed fetching tipset: %s", final.Key().String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does what you expect it to. I makes no assertion about the content of the error message. You need to check that more explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to be using ErrorEqual I think :)

net/graphsync_fetcher_test.go Outdated Show resolved Hide resolved
@hannahhoward hannahhoward force-pushed the feat/graphsync-timeout-improvement branch 3 times, most recently from a52c005 to 54b0c11 Compare September 24, 2019 01:12
@hannahhoward hannahhoward changed the base branch from master to feat/timer-mocks September 24, 2019 01:13
@hannahhoward hannahhoward requested a review from anorth September 24, 2019 01:47
@hannahhoward
Copy link
Contributor Author

@anorth this should have issues resolved now

Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this work.

I have to admit, the tests and mock behaviour are now pretty opaque. It may all be necessary complexity, but I think it's worth another think about how they could be constructed to be more direct.

}

// NewGraphSyncFetcher returns a GraphsyncFetcher wired up to the input Graphsync exchange and
// attached local blockservice for reloading blocks in memory once they are returned
func NewGraphSyncFetcher(ctx context.Context, exchange GraphExchange, blockstore bstore.Blockstore,
bv consensus.SyntaxValidator, pt graphsyncFallbackPeerTracker) *GraphSyncFetcher {
bv consensus.SyntaxValidator, systemClock clock.Clock, pt graphsyncFallbackPeerTracker) *GraphSyncFetcher {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoying package/var name collision. How about clk for the variable?

return anyError
}

func (gsf *GraphSyncFetcher) consumeResponse(requestChan <-chan graphsync.ResponseProgress, errChan <-chan error, cancelFunc func()) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please briefly describe this method's intent in a comment.

@hannahhoward hannahhoward force-pushed the feat/graphsync-timeout-improvement branch 2 times, most recently from 620525b to 47ef7e4 Compare September 24, 2019 20:41
@hannahhoward hannahhoward force-pushed the feat/graphsync-timeout-improvement branch from 47ef7e4 to 5cd0ffc Compare September 26, 2019 18:46
converts fetcher timeout to an unresponsiveness check -- if no data is received for 10 seconds,
consider the request failed. also provides variadic options to the fetcher to override the
unresponsiveness timeout
The graphsync fetcher now uses a clock as a dependency rather than relying on time directly.
Fix error tests to verify that error messages match what is expected
@hannahhoward hannahhoward force-pushed the feat/graphsync-timeout-improvement branch from 5cd0ffc to 6edb0ae Compare September 26, 2019 18:57
@hannahhoward hannahhoward changed the base branch from feat/timer-mocks to master September 26, 2019 18:58
@hannahhoward hannahhoward merged commit 42deded into master Sep 26, 2019
@zl03jsj zl03jsj deleted the feat/graphsync-timeout-improvement branch July 14, 2022 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fetching timeout update
5 participants