Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc accept loop: added backoff on logging #4974

Merged
merged 7 commits into from
Dec 13, 2018
Merged

rpc accept loop: added backoff on logging #4974

merged 7 commits into from
Dec 13, 2018

Conversation

cgbaker
Copy link
Contributor

@cgbaker cgbaker commented Dec 7, 2018

see NMD-1173 or #3686 (comment)

adds throttling to error logging on the Accept() loop in the RPC listener, to prevent from overloading the disk in case some problem (out of file handles, out of memory) causes a tight loop

this change is also needed in memberlist

nomad/rpc.go Outdated
tempDelay = max
}
r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay)
time.Sleep(tempDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could also exit if the context is canceled, but with a max timeout of 1s it's not a huge deal.

select {
case <-ctx.Done():
case <-time.After(tempDelay):
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, will do.

nomad/rpc.go Outdated
}
r.logger.Error("failed to accept RPC conn", "error", err, "delay", tempDelay)
time.Sleep(tempDelay)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the open question is: What to do for non-Temporary() errors?

We handle Temporary above fine, but we tight loop on permanent errors which seems bad. I think we exit, but that requires trusting net.Error#Temporary a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the net code. It looks like a non temporary failure is fatal in all the cases I'll traced.

@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael Would it be possible/good idea to restart the listener? If we error at net.Listen then its pretty safe to say we're hosed and should shutdown.

Possible? Yes, although it's not necessary for this PR or to fix the underlying issue.
Good idea? In theory yes, in practice I don't think it matters.

Ideally on permanent errors we don't spin at all and wait for a SIGHUP to hopefully change things and create new, valid listeners. In reality I'm not sure anything SIGHUP changes can impact permanent errors. They seem to mostly deal with conditions that are programming errors like trying to Accept on a non-socket FD, invalid FD, negative listen backlog, etc.

The only permanent error I even think is possible to encounter in Nomad is EINVAL because we've closed the listening socket on shutdown. We already check for the shutdown condition, so we'll never need to rely on the error check in that case.

Since as far as I know permanent errors are unreachable, I'm not too worried about how we handle them. It should never matter, but we should be conservative in our approach in case it does.

Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @cgbaker!

I've added a few comments to define some consts to get rid of the magic numbers.

nomad/rpc.go Outdated
case <-ctx.Done():
return
default:
if tempDelay == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is just complex enough to warrant it's own r.handleAcceptErr(err) func. It looks to me we're safe to stick the tempDelay into the struct (although I think it should be renamed to be more descriptive), and do the sleep in the new func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickethier , the return on r.shutdown and the continue (if we keep it) have to stay inside if the if err block, so only the sleep logic can go into handleAcceptErr. however, the r.acceptLoopDelay has to stay outside of the if block to reset the counter if err == nil, and i would be worried about the logic for acceptLoopDelay living in two places.

for example:
https://gist.github.com/cgbaker/ab90f40bcbdd97c28c057e768f39bfc9

nomad/rpc.go Outdated
@@ -84,6 +84,7 @@ type RPCContext struct {
// listen is used to listen for incoming RPC connections
func (r *rpcHandler) listen(ctx context.Context) {
defer close(r.listenerCh)
var tempDelay time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a more descriptive name. acceptLoopDelay or something along those lines

nomad/rpc.go Outdated
return
default:
if tempDelay == 0 {
tempDelay = 5 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use a const, ex:

	baseAcceptLoopDelay = 5 * time.Millisecond

nomad/rpc.go Outdated
}

r.logger.Error("failed to accept RPC conn", "error", err)
maxDelay := 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const:

maxAcceptLoopDelay = 5 * time.Second

nomad/rpc.go Outdated
r.logger.Error("failed to accept RPC conn", "error", err)
maxDelay := 5 * time.Second
if ne, ok := err.(net.Error); ok && ne.Temporary() {
maxDelay = 1 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

maxAcceptLoopTemporaryDelay = 5 * time.Second

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks great. Sorry for the long road to get here. 😅

My comments are just Go idioms and style, so fixup and merge as you see fit! No blockers.

nomad/rpc.go Outdated

go r.handleConn(ctx, conn, &RPCContext{Conn: conn})
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1)
}
}

// Sleep to avoid spamming the log, with a maximum delay according to whether or not the error is temporary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go has an idiom of starting method comments with the name of the method. No big deal.

Suggested change
// Sleep to avoid spamming the log, with a maximum delay according to whether or not the error is temporary
// handleAcceptErr sleeps to avoid spamming the log, with a maximum delay
// according to whether or not the error is temporary.

nomad/rpc.go Outdated

go r.handleConn(ctx, conn, &RPCContext{Conn: conn})
metrics.IncrCounter([]string{"nomad", "rpc", "accept_conn"}, 1)
}
}

// Sleep to avoid spamming the log, with a maximum delay according to whether or not the error is temporary
func (r *rpcHandler) handleAcceptErr(err error, ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go has an idiom of making ctx the first argument in method calls.

nomad/rpc.go Outdated
func (r *rpcHandler) handleAcceptErr(err error, ctx context.Context) {
const baseAcceptLoopDelay = 5 * time.Millisecond
const maxAcceptLoopDelay = 5 * time.Second
const maxAcceptLoopDelayTemporaryError = 1 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can shorten the names since they're locals: baseDelay, maxDelay, maxDelayTempError

No need to change.

nomad/rpc.go Outdated
gologger *golog.Logger
logger log.Logger
gologger *golog.Logger
acceptLoopDelay time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define this within listen() and pass it to the error handler method unless you put it here for testing.

@cgbaker cgbaker merged commit 5bcb24b into master Dec 13, 2018
@hashicorp hashicorp deleted a comment from schmichael Dec 13, 2018
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants