Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC retry logic in nomad client's rpc.go is incorrect #9265

Closed
benbuzbee opened this issue Nov 4, 2020 · 4 comments
Closed

RPC retry logic in nomad client's rpc.go is incorrect #9265

benbuzbee opened this issue Nov 4, 2020 · 4 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug
Milestone

Comments

@benbuzbee
Copy link
Contributor

benbuzbee commented Nov 4, 2020

Background

Nomad exposes HTTP APIs on the servers, and also on the clients. However, to get most answers, the request must eventually be made of a server, usually a Leader server. So, if one makes an HTTP Request to a client, they can expect the request to be forwarded to a server over RPC.

image

A Nomad API client may connect to one of the Nomad Agent Client's to make its request. If a server becomes unavailable during the processing of the request, the Nomad Agent Client contains retry logic to attempt to hide the availability problem from the API client.

At Cloudflare, we run Nomad on many thousands of machines and rely heavily on blocking queries which involve persistent connections, and therefore a high likelihood a node will restart during an outstanding TCP connection.

In doing so we've found several flaws with how the Nomad Agent Client attempts to retry during server availability issues. We have fixed them, and have been running them on our nodes for a few weeks with success.

We would like to contribute the fixes back, but we have not gotten much traction with some of the Consul library changes and need help.

Credit to Pierre Cauchois for running most of this down.

Current code

Most of that magic is in client/rpc.go
A snippet with some markup

// If this server had an issue we should make a note so we don't try it again, it may be 
c.servers.NotifyFailedServer(server)broken
// Don't retry very fatal things, or things with side-effects like writes
if retry := canRetry(args, rpcErr, firstCheck, c.config.RPCHoldTimeout); !retry { 
    return rpcErr
}

// Wait a bit to avoid thundering herd when servers fail
jitter := lib.RandomStagger(c.config.RPCHoldTimeout / structs.JitterFraction)
select {
case <-time.After(jitter):
    // Try again
    goto TRY
case <-c.shutdownCh:
}
return rpcErr

Problems

Unexpected response code: 500 (rpc error: EOF):
Unexpected response code: 500 (rpc error: rpc error: EOF)

Why didn't the nomad client retry for us, and why are there 2 different types of errors?

Almost certainly because canRetry returned false in the above code, so lets take a quick look at what it looked like before we got our hands in there

// canRetry returns true if the given situation is safe for a retry.
func canRetry(args interface{}, err error) bool {
	// No leader errors are always safe to retry since no state could have
	// been changed.
	if structs.IsErrNoLeader(err) {
		return true
	}

	// Reads are safe to retry for stream errors, such as if a server was
	// being shut down.
	info, ok := args.(structs.RPCInfo)
	if ok && info.IsRead() && lib.IsErrEOF(err) {
		return true
	}

	return false
}

For simplicity lets state these facts

  • The error is not an IsNoLeaderErr, it is an EOF
  • The request is a read only request, so info.IsRead returns true

IsEOF and Wrapped Errors

The first problem we found was lib.IsErrEOF returned false even though the error was in fact an EOF error and there are two reasons for nomad, plus a bonus problem with Consul.

// IsErrEOF returns true if we get an EOF error from the socket itself, or
// an EOF equivalent error from yamux.
func IsErrEOF(err error) bool {
	if err == io.EOF {
		return true
	}

	errStr := err.Error()
	if strings.Contains(errStr, yamuxStreamClosed) ||
		strings.Contains(errStr, yamuxSessionShutdown) {
		return true
	}

	return false
}

We have not had any luck getting the fixes merged into Consul but they are very straight forward.

err == io.EOF is wrong - Fix Not Merged

err == io.EOF is not enough to solve the problem here, because nomad returns the following wrapped err
From PR 8632 to Nomad

conn, sc, err := p.getRPCClient(region, addr, version)
	if err != nil {
		return fmt.Errorf("rpc error: %w", err)
    }

In this case, err == io.EOF is false, but errors.Is(err, io.EOF) is true. The fix is to change this conditional

The error is a non-wrapped string in Consul - Fix Not Merged

Related, it seems that Consul has very similar code to Nomad but without the PR that wrapped the error. In Consul, there, neither == not errors.Is will work. The fix is to port the Nomad fix.

Even with wrapping, some errors are strings - Fix Not Merged

errors.Is solves Unexpected response code: 500 (rpc error: EOF) - this is the error we get when the server that the client is connected to disconnects - a single EOF.

However, in the case where we are connected to a server, that forwards the request to ANOTHER server - what happens if the OTHER server disconnects? Well instead of the client getting an EOF, the FIRST server gets an EOF and wraps it, sending it back to the client as a STRING.

wrapped

At this point all wrapping has been lost, eaten by the type-destroying mess that is gRPC error handling. The fix is

var serverError rpc.ServerError
if errors.As(err, &serverError) {
    return strings.HasSuffix(err.Error(), fmt.Sprintf(": %s", io.EOF.Error()))
}

Long-poll requests and timeouts - Nomad Fix Merged - Consul Fix Not Merged

The next problem we had was with blocking requests. This is a feature of both the Nomad and Consul APIs where an HTTP request can be made along with an ?index query parameter to specify that the client is only interested in updates after a particular raft index. In the case that the state is currently prior to the provided raft index, the HTTP request will block until the first of

  1. The state changes and its index becomes >= the requested index
  2. The MaxQueryTime (5m by default) specified in the HTTP request is reached
  3. 10m passes

The retry logic does not appear to have accounted for these types of requests at all, lets take a look at the code

// We can wait a bit and retry!
if time.Since(firstCheck) < c.config.RPCHoldTimeout {
    jitter := lib.RandomStagger(c.config.RPCHoldTimeout / structs.JitterFraction)
    select {
    case <-time.After(jitter):
        goto TRY
    case <-c.shutdownCh:
    }
}

This indicates that it should not retry if more than c.config.RPCHoldTimeout (5s) has elapsed. This is a provision to keep from retrying forever, but instead only as long as it might reasonably take for a new server leader to be elected so that a new, healthy connection chain to a leader can be established.

Unfortunately, in the case of long poll request, this is almost always false. If we have made a request with a maximum query time of 5 minutes (the default) then, unless our index was already up-to-date when we made it, we will almost certainly have made the request more than 5 minutes ago, and so a retry would not occur.

The first for this is to take into account the MaxQueryTime for long-poll requests and allow them to be retried for at least that long.

Long Poll Bug - Nomad fix not merged - Consul fix is the same, waiting...

The fix which we merged into Nomad for the above issue is a good one - without it retries cannot possibly work, but it does have a flaw.

Consider:

  1. A long poll is made with a timeout of 5 minutes
  2. A retryable error occurs at the 4 minute mark
  3. We make a new request to the server

What you would expect is that the second request, #3, has a timeout of 1 minute. Unfortunately, that is not what happens. Thje request is made again from the top - with a timeout of 5 minutes just like the first - and so the entire request could take as much as 9 minutes, even though the client asked for 5 minutes.

What is even worse about this is that if another retryable error occurs at the 7 minute mark, the code will not retry because it will realize it has been 7 minutes with a timeout of 5 and so the client will see an EOF. What it should have seen was a timeout at the 5 minute mark instead.

The correct fix for this is clear: step 3 should make a request to the sever with a timeout of 1 minute (the original timeout, minus time elapsed already). However, how to implement that is not so traight forward.

Because the RPC helpers burry the request time in interfaces and re-infer defaults on the server, there is no easy way for the RPC helper to change the request time to reflect the updated elapsed time.

One way to do that is with reflect. This is an inelegant solution that fixes the bug with a hammer. To fix it correctly, the client needs a way to tell the server the timeout time in a way that is not invisible and type-lost to the RPC function. However the complexity and risk of such a change does not seem appropriate.

@benbuzbee benbuzbee changed the title RPC retry logic in nomad client's rpc.go is incorret RPC retry logic in nomad client's rpc.go is incorrect Nov 4, 2020
@tgross
Copy link
Member

tgross commented Nov 5, 2020

Hi @benbuzbee! Thanks for this detailed context. For us on the Nomad team, I'm taking away that:

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Nov 5, 2020
@joel0
Copy link
Contributor

joel0 commented Nov 5, 2020

Re: #9266 equivalent for Consul. I have written the code for our internal build of Consul. I've been holding off on making a PR until I see what happens with the Nomad PR. We will be glad to upstream the Consul change.

@tgross
Copy link
Member

tgross commented Dec 17, 2020

#8921 and #9266 shipped in 1.0.0, so closing this issue. Thanks again!

@tgross tgross closed this as completed Dec 17, 2020
@tgross tgross added this to the 1.0 milestone Dec 17, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug
Projects
None yet
Development

No branches or pull requests

4 participants