Better shorter timeouts on Graphsync Requests #3460

hannahhoward · 2019-09-18T21:16:09Z

Goals

When a peer becomes unresponsive to Graphsync Fetcher requests, fail more quickly and move on to other peers

Implementation

Timeout for a graphsync request is now based on unresponsiveness, rather than total time. IOW -- if a request stop sending data for a specific time, cancel it.
Because the timeout is based on unresponsiveness rather than total time, use a shorter timeout than before (10 seconds)
Also provides variadic options struct to the fetcher to override the
unresponsiveness timeout
Add tests of timeout behavior to verify behavior when various requests simply hangup with no further response

fix #3371

codecov-io · 2019-09-18T21:24:13Z

Codecov Report

Merging #3460 into master will increase coverage by <1%.
The diff coverage is 88%.

@@           Coverage Diff           @@
##           master   #3460    +/-   ##
=======================================
+ Coverage      44%     44%   +<1%     
=======================================
  Files         239     242     +3     
  Lines       15411   15461    +50     
=======================================
+ Hits         6859    6908    +49     
+ Misses       7582    7572    -10     
- Partials      970     981    +11

ZenGround0 · 2019-09-18T22:43:08Z

net/graphsync_fetcher.go

+	// Timeout for a single graphsync request getting "stuck"
+	// -- if no more responses are received for a period greater than this,
+	// we will assume the request has hung-up and cancel it
+	unresponsiveTimeout = 10 * time.Second


I'm curious -- do we have information on the upper bound of the delay we would expect with high probability from a peer with no network issues? My intuition is that we want to set this as low as we can reasonably get away with before we start killing productive connections. My uninformed intuition is also that 10 seconds is probably higher than we need and I'd love to know if this is wrong and 10 seconds is already pushing the limit.

The short version is I honestly don't know :( I would probably prefer to wait on dropping it lower till we are at least requesting two requests in parallel. Then if we get a false positive on believing ourselves unresponsive we can at least still pickup from the other request.

another option would be to track actual latency and use a multiple of the average latency between progress steps

defaultProgressTimeout

another option would be to track actual latency

Not sure how it fits into priorities and I'm guessing this comes after other fires are put out but I love the idea of making this decision using observations. I'm imagining gathering a distribution of latencies over different data (single block, long chains) separating into "healthy" and "unhealthy" connections and taking the cutoff point to be some threshold (95% of latencies grouped into "healthy connections" for example).

net/graphsync_fetcher.go

anorth

Great start and thanks for the thorough testing, but the direct dependency on real time passing is a no-go. We're trying to root out the last of those soon.

net/graphsync_fetcher.go

anorth · 2019-09-19T04:36:08Z

net/graphsync_fetcher.go

+	// Timeout for a single graphsync request getting "stuck"
+	// -- if no more responses are received for a period greater than this,
+	// we will assume the request has hung-up and cancel it
+	unresponsiveTimeout = 10 * time.Second


defaultProgressTimeout

net/graphsync_fetcher.go

anorth · 2019-09-19T04:48:06Z

net/graphsync_fetcher_test.go

+		ts, err := fetcher.FetchTipSets(ctx, final.Key(), pid0, done)
+		mgs.verifyReceivedRequestCount(7)
+		mgs.verifyExpectations()
+		require.Errorf(t, err, "Failed fetching tipset: %s", final.Key().String())


I don't think this does what you expect it to. I makes no assertion about the content of the error message. You need to check that more explicitly.

I need to be using ErrorEqual I think :)

net/graphsync_fetcher_test.go

net/graphsync_fetcher.go

hannahhoward · 2019-09-24T01:47:44Z

@anorth this should have issues resolved now

anorth

Thanks for all this work.

I have to admit, the tests and mock behaviour are now pretty opaque. It may all be necessary complexity, but I think it's worth another think about how they could be constructed to be more direct.

anorth · 2019-09-24T04:44:03Z

net/graphsync_fetcher.go

 }

 // NewGraphSyncFetcher returns a GraphsyncFetcher wired up to the input Graphsync exchange and
 // attached local blockservice for reloading blocks in memory once they are returned
 func NewGraphSyncFetcher(ctx context.Context, exchange GraphExchange, blockstore bstore.Blockstore,
-	bv consensus.SyntaxValidator, pt graphsyncFallbackPeerTracker) *GraphSyncFetcher {
+	bv consensus.SyntaxValidator, systemClock clock.Clock, pt graphsyncFallbackPeerTracker) *GraphSyncFetcher {


Annoying package/var name collision. How about clk for the variable?

anorth · 2019-09-24T04:47:41Z

net/graphsync_fetcher.go

+	return anyError
+}
+
+func (gsf *GraphSyncFetcher) consumeResponse(requestChan <-chan graphsync.ResponseProgress, errChan <-chan error, cancelFunc func()) error {


Please briefly describe this method's intent in a comment.

converts fetcher timeout to an unresponsiveness check -- if no data is received for 10 seconds, consider the request failed. also provides variadic options to the fetcher to override the unresponsiveness timeout

The graphsync fetcher now uses a clock as a dependency rather than relying on time directly.

Fix error tests to verify that error messages match what is expected

hannahhoward force-pushed the feat/graphsync-timeout-improvement branch from 2551a43 to d91a6e6 Compare September 18, 2019 21:23

hannahhoward requested review from anorth, ZenGround0 and frrist September 18, 2019 21:23

ZenGround0 reviewed Sep 18, 2019

View reviewed changes

frrist approved these changes Sep 18, 2019

View reviewed changes

net/graphsync_fetcher.go Outdated Show resolved Hide resolved

anorth requested changes Sep 19, 2019

View reviewed changes

ZenGround0 reviewed Sep 19, 2019

View reviewed changes

net/graphsync_fetcher.go Outdated Show resolved Hide resolved

hannahhoward mentioned this pull request Sep 19, 2019

feat(clock): add timer functionality #3468

Merged

hannahhoward force-pushed the feat/graphsync-timeout-improvement branch 3 times, most recently from a52c005 to 54b0c11 Compare September 24, 2019 01:12

hannahhoward changed the base branch from master to feat/timer-mocks September 24, 2019 01:13

hannahhoward requested a review from anorth September 24, 2019 01:47

anorth approved these changes Sep 24, 2019

View reviewed changes

hannahhoward force-pushed the feat/graphsync-timeout-improvement branch 2 times, most recently from 620525b to 47ef7e4 Compare September 24, 2019 20:41

hannahhoward mentioned this pull request Sep 25, 2019

Add 0.5.6 release notes #3490

Merged

hannahhoward force-pushed the feat/timer-mocks branch from a068dd4 to 94b8abe Compare September 26, 2019 18:45

hannahhoward force-pushed the feat/graphsync-timeout-improvement branch from 47ef7e4 to 5cd0ffc Compare September 26, 2019 18:46

hannahhoward added 3 commits September 26, 2019 11:57

feat(net): better timeout

fa21f38

converts fetcher timeout to an unresponsiveness check -- if no data is received for 10 seconds, consider the request failed. also provides variadic options to the fetcher to override the unresponsiveness timeout

feat(net): switch to mocked out clock

0bc37cc

The graphsync fetcher now uses a clock as a dependency rather than relying on time directly.

fix(net): correctly test errors

6edb0ae

Fix error tests to verify that error messages match what is expected

hannahhoward force-pushed the feat/graphsync-timeout-improvement branch from 5cd0ffc to 6edb0ae Compare September 26, 2019 18:57

hannahhoward changed the base branch from feat/timer-mocks to master September 26, 2019 18:58

hannahhoward merged commit 42deded into master Sep 26, 2019

zl03jsj deleted the feat/graphsync-timeout-improvement branch July 14, 2022 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better shorter timeouts on Graphsync Requests #3460

Better shorter timeouts on Graphsync Requests #3460

hannahhoward commented Sep 18, 2019

codecov-io commented Sep 18, 2019 •

edited

Loading

ZenGround0 Sep 18, 2019

hannahhoward Sep 18, 2019

hannahhoward Sep 18, 2019

anorth Sep 19, 2019

ZenGround0 Sep 19, 2019

anorth left a comment

anorth Sep 19, 2019

anorth Sep 19, 2019

hannahhoward Sep 19, 2019

hannahhoward commented Sep 24, 2019

anorth left a comment

anorth Sep 24, 2019

anorth Sep 24, 2019

Better shorter timeouts on Graphsync Requests #3460

Better shorter timeouts on Graphsync Requests #3460

Conversation

hannahhoward commented Sep 18, 2019

Goals

Implementation

codecov-io commented Sep 18, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anorth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahhoward commented Sep 24, 2019

anorth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Sep 18, 2019 •

edited

Loading