RPC retry logic in nomad client's rpc.go is incorrect #9265

benbuzbee · 2020-11-04T22:01:34Z

Background

Nomad exposes HTTP APIs on the servers, and also on the clients. However, to get most answers, the request must eventually be made of a server, usually a Leader server. So, if one makes an HTTP Request to a client, they can expect the request to be forwarded to a server over RPC.

A Nomad API client may connect to one of the Nomad Agent Client's to make its request. If a server becomes unavailable during the processing of the request, the Nomad Agent Client contains retry logic to attempt to hide the availability problem from the API client.

At Cloudflare, we run Nomad on many thousands of machines and rely heavily on blocking queries which involve persistent connections, and therefore a high likelihood a node will restart during an outstanding TCP connection.

In doing so we've found several flaws with how the Nomad Agent Client attempts to retry during server availability issues. We have fixed them, and have been running them on our nodes for a few weeks with success.

We would like to contribute the fixes back, but we have not gotten much traction with some of the Consul library changes and need help.

Credit to Pierre Cauchois for running most of this down.

Current code

Most of that magic is in client/rpc.go
A snippet with some markup

// If this server had an issue we should make a note so we don't try it again, it may be 
c.servers.NotifyFailedServer(server)broken
// Don't retry very fatal things, or things with side-effects like writes
if retry := canRetry(args, rpcErr, firstCheck, c.config.RPCHoldTimeout); !retry { 
    return rpcErr
}

// Wait a bit to avoid thundering herd when servers fail
jitter := lib.RandomStagger(c.config.RPCHoldTimeout / structs.JitterFraction)
select {
case <-time.After(jitter):
    // Try again
    goto TRY
case <-c.shutdownCh:
}
return rpcErr

Problems

Unexpected response code: 500 (rpc error: EOF):

Unexpected response code: 500 (rpc error: rpc error: EOF)

Why didn't the nomad client retry for us, and why are there 2 different types of errors?

Almost certainly because canRetry returned false in the above code, so lets take a quick look at what it looked like before we got our hands in there

// canRetry returns true if the given situation is safe for a retry.
func canRetry(args interface{}, err error) bool {
	// No leader errors are always safe to retry since no state could have
	// been changed.
	if structs.IsErrNoLeader(err) {
		return true
	}

	// Reads are safe to retry for stream errors, such as if a server was
	// being shut down.
	info, ok := args.(structs.RPCInfo)
	if ok && info.IsRead() && lib.IsErrEOF(err) {
		return true
	}

	return false
}

For simplicity lets state these facts

The error is not an IsNoLeaderErr, it is an EOF
The request is a read only request, so info.IsRead returns true

IsEOF and Wrapped Errors

The first problem we found was lib.IsErrEOF returned false even though the error was in fact an EOF error and there are two reasons for nomad, plus a bonus problem with Consul.

// IsErrEOF returns true if we get an EOF error from the socket itself, or
// an EOF equivalent error from yamux.
func IsErrEOF(err error) bool {
	if err == io.EOF {
		return true
	}

	errStr := err.Error()
	if strings.Contains(errStr, yamuxStreamClosed) ||
		strings.Contains(errStr, yamuxSessionShutdown) {
		return true
	}

	return false
}

We have not had any luck getting the fixes merged into Consul but they are very straight forward.

err == io.EOF is wrong - Fix Not Merged

err == io.EOF is not enough to solve the problem here, because nomad returns the following wrapped err
From PR 8632 to Nomad

conn, sc, err := p.getRPCClient(region, addr, version)
	if err != nil {
		return fmt.Errorf("rpc error: %w", err)
    }

In this case, err == io.EOF is false, but errors.Is(err, io.EOF) is true. The fix is to change this conditional

The error is a non-wrapped string in Consul - Fix Not Merged

Related, it seems that Consul has very similar code to Nomad but without the PR that wrapped the error. In Consul, there, neither == not errors.Is will work. The fix is to port the Nomad fix.

Even with wrapping, some errors are strings - Fix Not Merged

errors.Is solves Unexpected response code: 500 (rpc error: EOF) - this is the error we get when the server that the client is connected to disconnects - a single EOF.

However, in the case where we are connected to a server, that forwards the request to ANOTHER server - what happens if the OTHER server disconnects? Well instead of the client getting an EOF, the FIRST server gets an EOF and wraps it, sending it back to the client as a STRING.

At this point all wrapping has been lost, eaten by the type-destroying mess that is gRPC error handling. The fix is

var serverError rpc.ServerError
if errors.As(err, &serverError) {
    return strings.HasSuffix(err.Error(), fmt.Sprintf(": %s", io.EOF.Error()))
}

Long-poll requests and timeouts - Nomad Fix Merged - Consul Fix Not Merged

The next problem we had was with blocking requests. This is a feature of both the Nomad and Consul APIs where an HTTP request can be made along with an ?index query parameter to specify that the client is only interested in updates after a particular raft index. In the case that the state is currently prior to the provided raft index, the HTTP request will block until the first of

The state changes and its index becomes >= the requested index
The MaxQueryTime (5m by default) specified in the HTTP request is reached
10m passes

The retry logic does not appear to have accounted for these types of requests at all, lets take a look at the code

// We can wait a bit and retry!
if time.Since(firstCheck) < c.config.RPCHoldTimeout {
    jitter := lib.RandomStagger(c.config.RPCHoldTimeout / structs.JitterFraction)
    select {
    case <-time.After(jitter):
        goto TRY
    case <-c.shutdownCh:
    }
}

This indicates that it should not retry if more than c.config.RPCHoldTimeout (5s) has elapsed. This is a provision to keep from retrying forever, but instead only as long as it might reasonably take for a new server leader to be elected so that a new, healthy connection chain to a leader can be established.

Unfortunately, in the case of long poll request, this is almost always false. If we have made a request with a maximum query time of 5 minutes (the default) then, unless our index was already up-to-date when we made it, we will almost certainly have made the request more than 5 minutes ago, and so a retry would not occur.

The first for this is to take into account the MaxQueryTime for long-poll requests and allow them to be retried for at least that long.

Long Poll Bug - Nomad fix not merged - Consul fix is the same, waiting...

The fix which we merged into Nomad for the above issue is a good one - without it retries cannot possibly work, but it does have a flaw.

Consider:

A long poll is made with a timeout of 5 minutes
A retryable error occurs at the 4 minute mark
We make a new request to the server

What you would expect is that the second request, #3, has a timeout of 1 minute. Unfortunately, that is not what happens. Thje request is made again from the top - with a timeout of 5 minutes just like the first - and so the entire request could take as much as 9 minutes, even though the client asked for 5 minutes.

What is even worse about this is that if another retryable error occurs at the 7 minute mark, the code will not retry because it will realize it has been 7 minutes with a timeout of 5 and so the client will see an EOF. What it should have seen was a timeout at the 5 minute mark instead.

The correct fix for this is clear: step 3 should make a request to the sever with a timeout of 1 minute (the original timeout, minus time elapsed already). However, how to implement that is not so traight forward.

Because the RPC helpers burry the request time in interfaces and re-infer defaults on the server, there is no easy way for the RPC helper to change the request time to reflect the updated elapsed time.

One way to do that is with reflect. This is an inelegant solution that fixes the bug with a hammer. To fix it correctly, the client needs a way to tell the server the timeout time in a way that is not invisible and type-lost to the RPC function. However the complexity and risk of such a change does not seem appropriate.

The text was updated successfully, but these errors were encountered:

tgross · 2020-11-05T13:36:12Z

Hi @benbuzbee! Thanks for this detailed context. For us on the Nomad team, I'm taking away that:

We need to gently nudge our colleagues on the Consul team about Wrap rpc error object consul#8976 and Use errors.Is() in IsErrEOF() consul#8698
We need to review Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266 (Thanks for that PR and we'll get to that shortly!)
Once Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266 is merged, we might also need to nudge the Consul folks about an equivalent PR?

joel0 · 2020-11-05T17:34:29Z

Re: #9266 equivalent for Consul. I have written the code for our internal build of Consul. I've been holding off on making a PR until I see what happens with the Nomad PR. We will be glad to upstream the Consul change.

tgross · 2020-12-17T19:17:27Z

#8921 and #9266 shipped in 1.0.0, so closing this issue. Thanks again!

github-actions · 2022-10-26T02:40:17Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

benbuzbee mentioned this issue Nov 4, 2020

Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

Merged

benbuzbee changed the title ~~RPC retry logic in nomad client's rpc.go is incorret~~ RPC retry logic in nomad client's rpc.go is incorrect Nov 4, 2020

jrasell added theme/client type/bug labels Nov 5, 2020

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Nov 5, 2020

tgross closed this as completed Dec 17, 2020

tgross added this to the 1.0 milestone Dec 17, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC retry logic in nomad client's rpc.go is incorrect #9265

RPC retry logic in nomad client's rpc.go is incorrect #9265

benbuzbee commented Nov 4, 2020 •

edited

Loading

tgross commented Nov 5, 2020

joel0 commented Nov 5, 2020

tgross commented Dec 17, 2020

github-actions bot commented Oct 26, 2022

RPC retry logic in nomad client's rpc.go is incorrect #9265

RPC retry logic in nomad client's rpc.go is incorrect #9265

Comments

benbuzbee commented Nov 4, 2020 • edited Loading

Background

Current code

Problems

IsEOF and Wrapped Errors

err == io.EOF is wrong - Fix Not Merged

The error is a non-wrapped string in Consul - Fix Not Merged

Even with wrapping, some errors are strings - Fix Not Merged

Long-poll requests and timeouts - Nomad Fix Merged - Consul Fix Not Merged

Long Poll Bug - Nomad fix not merged - Consul fix is the same, waiting...

tgross commented Nov 5, 2020

joel0 commented Nov 5, 2020

tgross commented Dec 17, 2020

github-actions bot commented Oct 26, 2022

benbuzbee commented Nov 4, 2020 •

edited

Loading