Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc accept loop: added backoff on logging #4974

Merged
merged 7 commits into from
Dec 13, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 15 additions & 6 deletions nomad/rpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ type RPCContext struct {
// listen is used to listen for incoming RPC connections
func (r *rpcHandler) listen(ctx context.Context) {
defer close(r.listenerCh)
var tempDelay time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a more descriptive name. acceptLoopDelay or something along those lines

for {
select {
case <-ctx.Done():
Expand All @@ -99,15 +100,23 @@ func (r *rpcHandler) listen(ctx context.Context) {
return
}

select {
case <-ctx.Done():
return
default:
if tempDelay == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is just complex enough to warrant it's own r.handleAcceptErr(err) func. It looks to me we're safe to stick the tempDelay into the struct (although I think it should be renamed to be more descriptive), and do the sleep in the new func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickethier , the return on r.shutdown and the continue (if we keep it) have to stay inside if the if err block, so only the sleep logic can go into handleAcceptErr. however, the r.acceptLoopDelay has to stay outside of the if block to reset the counter if err == nil, and i would be worried about the logic for acceptLoopDelay living in two places.

for example:
https://gist.github.com/cgbaker/ab90f40bcbdd97c28c057e768f39bfc9

tempDelay = 5 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use a const, ex:

	baseAcceptLoopDelay = 5 * time.Millisecond

} else {
tempDelay *= 2
}

r.logger.Error("failed to accept RPC conn", "error", err)
maxDelay := 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const:

maxAcceptLoopDelay = 5 * time.Second

if ne, ok := err.(net.Error); ok && ne.Temporary() {
maxDelay = 1 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

maxAcceptLoopTemporaryDelay = 5 * time.Second

}
if tempDelay > maxDelay {
tempDelay = maxDelay
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the open question is: What to do for non-Temporary() errors?

We handle Temporary above fine, but we tight loop on permanent errors which seems bad. I think we exit, but that requires trusting net.Error#Temporary a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the net code. It looks like a non temporary failure is fatal in all the cases I'll traced.

@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.

Possible? Yes, although it's not necessary for this PR or to fix the underlying issue.
Good idea? In theory yes, in practice I don't think it matters.

Ideally on permanent errors we don't spin at all and wait for a SIGHUP to hopefully change things and create new, valid listeners. In reality I'm not sure anything SIGHUP changes can impact permanent errors. They seem to mostly deal with conditions that are programming errors like trying to Accept on a non-socket FD, invalid FD, negative listen backlog, etc.

The only permanent error I even think is possible to encounter in Nomad is EINVAL because we've closed the listening socket on shutdown. We already check for the shutdown condition, so we'll never need to rely on the error check in that case.

Since as far as I know permanent errors are unreachable, I'm not too worried about how we handle them. It should never matter, but we should be conservative in our approach in case it does.

r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay)
time.Sleep(tempDelay)
continue
}
tempDelay = 0

go r.handleConn(ctx, conn, &RPCContext{Conn: conn})
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1)
Expand Down
8 changes: 7 additions & 1 deletion vendor/github.com/hashicorp/memberlist/Makefile

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

85 changes: 8 additions & 77 deletions vendor/github.com/hashicorp/memberlist/README.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

49 changes: 36 additions & 13 deletions vendor/github.com/hashicorp/memberlist/config.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion vendor/github.com/hashicorp/memberlist/delegate.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading