Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: retry server busy #3758

Merged
merged 2 commits into from
May 11, 2020
Merged

rpc: retry server busy #3758

merged 2 commits into from
May 11, 2020

Conversation

oncilla
Copy link
Contributor

@oncilla oncilla commented May 8, 2020

Add RPC retry with exponential back-off in case of quic server busy
error.
Reduce synchronous part of accept loop.

With the bump to quic-go v0.15.5 (#3732) the quic library now enforces a
max accept queue length. With our approach of every RPC is a new quic
connection, and some work being done synchronously in the accept loop
this bites on CI heavily.

With the exponential back-off and immediately spawning a go routine, we
can improve failure rate significantly.

But this should not be the final solution. Applications need to be
able to deal with requests failing. Currently, we only do one shot
testing, which might be too strict.

For investigation results, see the buildkite builds associated with
the draft PR #3750.


This change is Reviewable

@oncilla oncilla requested review from lukedirtwalker and scrye May 8, 2020 13:29
@oncilla oncilla mentioned this pull request May 8, 2020
oncilla added 2 commits May 8, 2020 17:28
Add RPC retry with exponential back-off in case of quic `server busy`
error.
Reduce synchronous part of accept loop.

With the bump to quic-go v0.15.5 (scionproto#3732) the quic library now enforces a
max accept queue length. With our approach of every RPC is a new quic
connection, and some work being done synchronously in the accept loop
this bites on CI heavily.

With the exponential back-off and immediately spawning a go routine, we
can improve failure rate significantly.

But this should not be the final solution. Applications need to be
able to deal with requests failing. Currently, we only do one shot
testing, which might be too strict.

For investigation results, see the buildkite builds associated with
the draft PR scionproto#3750.
Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @scrye)

Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @scrye)

@oncilla oncilla merged commit 778d62d into scionproto:master May 11, 2020
@oncilla oncilla deleted the pub-fix-quic branch May 11, 2020 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants