Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc accept loop: added backoff on logging #4974

Merged
merged 7 commits into from
Dec 13, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion nomad/rpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ type RPCContext struct {
// listen is used to listen for incoming RPC connections
func (r *rpcHandler) listen(ctx context.Context) {
defer close(r.listenerCh)
var tempDelay time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a more descriptive name. acceptLoopDelay or something along those lines

for {
select {
case <-ctx.Done():
Expand All @@ -105,9 +106,21 @@ func (r *rpcHandler) listen(ctx context.Context) {
default:
}

r.logger.Error("failed to accept RPC conn", "error", err)
if ne, ok := err.(net.Error); ok && ne.Temporary() {
if tempDelay == 0 {
tempDelay = 5 * time.Millisecond
} else {
tempDelay *= 2
}
if max := 1 * time.Second; tempDelay > max {
tempDelay = max
}
r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay)
time.Sleep(tempDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could also exit if the context is canceled, but with a max timeout of 1s it's not a huge deal.

select {
case <-ctx.Done():
case <-time.After(tempDelay):
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, will do.

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the open question is: What to do for non-Temporary() errors?

We handle Temporary above fine, but we tight loop on permanent errors which seems bad. I think we exit, but that requires trusting net.Error#Temporary a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the net code. It looks like a non temporary failure is fatal in all the cases I'll traced.

@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.

Possible? Yes, although it's not necessary for this PR or to fix the underlying issue.
Good idea? In theory yes, in practice I don't think it matters.

Ideally on permanent errors we don't spin at all and wait for a SIGHUP to hopefully change things and create new, valid listeners. In reality I'm not sure anything SIGHUP changes can impact permanent errors. They seem to mostly deal with conditions that are programming errors like trying to Accept on a non-socket FD, invalid FD, negative listen backlog, etc.

The only permanent error I even think is possible to encounter in Nomad is EINVAL because we've closed the listening socket on shutdown. We already check for the shutdown condition, so we'll never need to rely on the error check in that case.

Since as far as I know permanent errors are unreachable, I'm not too worried about how we handle them. It should never matter, but we should be conservative in our approach in case it does.

continue
}
tempDelay = 0

go r.handleConn(ctx, conn, &RPCContext{Conn: conn})
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1)
Expand Down