Handle FSM.Apply errors in raftApply #9991

dnephin · 2021-04-09T16:32:19Z

Previously we were inconsistently checking the response for errors. This
PR moves the response-is-error check into raftApply, so that all callers
can look at only the error response, instead of having to know that
errors could come from two places.

This should expose a few more errors that were previously hidden because
some calls to raftApply were ignoring the response return value.

Also handle errors more consistently. In some cases we would log the
error before returning it. This can be very confusing because it can
result in the same error being logged multiple times. Instead return
a wrapped error.

Also fixes a bug with canRetry and chunking errors. Previously it never would
have retried on those errors because it was looking at the wrong arg.

dnephin · 2021-04-09T16:35:34Z

agent/consul/acl_endpoint.go

 	// Purge the identity from the cache to prevent using the previous definition of the identity
 	a.srv.acls.cache.RemoveIdentity(tokenSecretCacheID(token.SecretID))

-	if respErr, ok := resp.(error); ok {
-		return respErr
-	}


This is one of a few places where we were treating these errors differently. As far as I can tell, if the FSM.Apply returns an error, we shouldn't consider the operation a success, so I suspect this new behaviour is actually more correct.

We keep the identity in the cache if the FSM.Apply errors, because the operation did not complete.

There are a bunch more of these cases in this file.

dnephin · 2021-04-09T16:41:35Z

agent/consul/rpc.go

+// raftApplyMsgpack encodes the msg using msgpack and calls raft.Apply. See
+// raftApplyWithEncoder.
+// Deprecated: use raftApplyMsgpack
 func (s *Server) raftApply(t structs.MessageType, msg interface{}) (interface{}, error) {
 	return s.raftApplyMsgpack(t, msg)
 }

-// raftApplyMsgpack will msgpack encode the request and then run it through raft,
-// then return the FSM response along with any errors.
+// raftApplyMsgpack encodes the msg using msgpack and calls raft.Apply. See
+// raftApplyWithEncoder.
 func (s *Server) raftApplyMsgpack(t structs.MessageType, msg interface{}) (interface{}, error) {
 	return s.raftApplyWithEncoder(t, msg, structs.Encode)
 }

-// raftApplyProtobuf will protobuf encode the request and then run it through raft,
-// then return the FSM response along with any errors.
+// raftApplyProtobuf encodes the msg using protobuf and calls raft.Apply. See
+// raftApplyWithEncoder.
 func (s *Server) raftApplyProtobuf(t structs.MessageType, msg interface{}) (interface{}, error) {
 	return s.raftApplyWithEncoder(t, msg, structs.EncodeProtoInterface)
 }

-// raftApplyWithEncoder is used to encode a message, run it through raft,
-// and return the FSM response along with any errors. Unlike raftApply this
-// takes the encoder to use as an argument.
-func (s *Server) raftApplyWithEncoder(t structs.MessageType, msg interface{}, encoder raftEncoder) (interface{}, error) {
+// raftApplyWithEncoder encodes a message, and then calls raft.Apply with the
+// encoded message. Returns the FSM response along with any errors. If the
+// FSM.Apply response is an error it will be returned as the error return
+// value with a nil response.


I updated the docs here and deprecated the old raftApply.

dnephin · 2021-04-09T16:41:58Z

agent/consul/rpc.go

+			return nil, ErrChunkingResubmit
 		}
 		// We expect that this conversion should always work
 		chunkedSuccess, ok := resp.(raftchunking.ChunkingSuccess)
 		if !ok {
 			return nil, errors.New("unknown type of response back from chunking FSM")
 		}
-		// Return the inner wrapped response
-		return chunkedSuccess.Response, nil
+		resp = chunkedSuccess.Response
 	}

+	if err, ok := resp.(error); ok {
+		return nil, err
+	}


This is the core of the change.

dnephin · 2021-04-09T16:42:48Z

agent/consul/session_endpoint.go

@@ -160,10 +160,6 @@ func (s *Session) Apply(args *structs.SessionRequest, reply *string) error {
 		s.srv.clearSessionTimer(args.Session.ID)
 	}

-	if respErr, ok := resp.(error); ok {
-		return respErr
-	}


Another case where we handled the errors differently.

dnephin · 2021-04-09T16:58:21Z

agent/consul/rpc.go

@@ -789,17 +796,19 @@ func (s *Server) raftApplyWithEncoder(t structs.MessageType, msg interface{}, en
 			// apply function. Downstream client code expects to see any error
 			// from the FSM (as opposed to the apply itself) and decide whether
 			// it can retry in the future's response.
-			return ErrChunkingResubmit, nil
+			return nil, ErrChunkingResubmit


Hmm, this return nil, ErrChunkingResubmit part of the change I need to look at. The handling of this in canRetry is strange.

Ya, the handling of ErrChunkingResubmit seems completely broken. It's only ever referenced in one place here: https://github.com/hashicorp/consul/blob/master/agent/consul/rpc.go#L530-L539

But looking at the callers to canRetry, there's no way that an error is ever going to be passed as the args argument:

consul/agent/consul/client.go

Line 289 in a02245b

if retry := canRetry(args, rpcErr); !retry {

- args is going to be the Request struct for the RPC

consul/agent/consul/rpc.go

Line 607 in a02245b

if rpcErr != nil && canRetry(info, rpcErr) {

- info is of type RPCInfo, not an error.

So it looks like this was always broken.

I pushed another commit which should fix this problem.

dnephin · 2021-04-15T15:41:05Z

agent/consul/rpc.go

-	// If we are chunking and it doesn't seem to have completed, try again
-	intErr, ok := args.(error)
-	if ok && strings.Contains(intErr.Error(), ErrChunkingResubmit.Error()) {
+	// If we are chunking and it doesn't seem to have completed, try again.
+	if err != nil && strings.Contains(err.Error(), ErrChunkingResubmit.Error()) {


This is the bug fix for the second bug I noticed.

Is there a test case that exercises this ErrChunkingResubmit flow to verify this is a-ok?

There must not be, because it was previously broken. I'm not sure how chunking works, I'll see if I can write a test for it.

I'll work on a test in a follow up. I think there is more cleanup to do around canRetry to prevent similar bugs in the future.

rboyer

LGTM (with small nonblocking comments)

Previously we were inconsistently checking the response for errors. This PR moves the response-is-error check into raftApply, so that all callers can look at only the error response, instead of having to know that errors could come from two places. This should expose a few more errors that were previously hidden because in some calls to raftApply we were ignoring the response return value. Also handle errors more consistently. In some cases we would log the error before returning it. This can be very confusing because it can result in the same error being logged multiple times. Instead return a wrapped error.

Previously canRetry was attempting to retrieve this error from args, however there was never any callers that would pass an error to args. With the change to raftApply to move this error to the error return value, it is now possible to receive this error from the err argument. This commit updates canRetry to check for ErrChunkingResubmit in err.

hashicorp-ci · 2021-04-20T17:59:20Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/354044.

hashicorp-ci · 2021-04-20T18:01:12Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/354070.

hashicorp-ci · 2021-04-20T18:01:16Z

🍒✅ Cherry pick of commit ce6bf56 onto release/1.10.x succeeded!

Handle FSM.Apply errors in raftApply

The signature of the `raftApply` function requires that the caller unwrap the first returned value (the response from `FSM.Apply`) to see if it's an error. This puts the burden on the caller to remember to check two different places for errors, and we've done so inconsistently. Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error as the error value. Similar work was done in Consul in hashicorp/consul#9991. This eliminates some boilerplate and surfaces a few minor bugs in the process: * job deregistrations of already-GC'd jobs were still emitting evals * reconcile job summaries does not return scheduler errors * node updates did not report errors associated with inconsistent service discovery or CSI plugin states Note that although _most_ of the `FSM.Apply` functions return only errors (which makes it tempting to remove the first return value entirely), there are few that return `bool` for some reason and Variables relies on the response value for proper CAS checking.

The signature of the `raftApply` function requires that the caller unwrap the first returned value (the response from `FSM.Apply`) to see if it's an error. This puts the burden on the caller to remember to check two different places for errors, and we've done so inconsistently. Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error as the error value. Similar work was done in Consul in hashicorp/consul#9991. This eliminates some boilerplate and surfaces a few minor bugs in the process: * job deregistrations of already-GC'd jobs were still emitting evals * reconcile job summaries does not return scheduler errors * node updates did not report errors associated with inconsistent service discovery or CSI plugin states Note that although _most_ of the `FSM.Apply` functions return only errors (which makes it tempting to remove the first return value entirely), there are few that return `bool` for some reason and Variables relies on the response value for proper CAS checking. Co-authored-by: Tim Gross <tgross@hashicorp.com>

The signature of the `raftApply` function requires that the caller unwrap the first returned value (the response from `FSM.Apply`) to see if it's an error. This puts the burden on the caller to remember to check two different places for errors, and we've done so inconsistently. Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error as the error value. Similar work was done in Consul in hashicorp/consul#9991. This eliminates some boilerplate and surfaces a few minor bugs in the process: * job deregistrations of already-GC'd jobs were still emitting evals * reconcile job summaries does not return scheduler errors * node updates did not report errors associated with inconsistent service discovery or CSI plugin states Note that although _most_ of the `FSM.Apply` functions return only errors (which makes it tempting to remove the first return value entirely), there are few that return `bool` for some reason and Variables relies on the response value for proper CAS checking.

dnephin added theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization pr/no-changelog PR does not need a corresponding .changelog entry labels Apr 9, 2021

dnephin commented Apr 9, 2021

View reviewed changes

vercel bot temporarily deployed to Preview – consul-ui-staging April 9, 2021 17:34 Inactive

vercel bot temporarily deployed to Preview – consul April 9, 2021 17:34 Inactive

dnephin requested a review from a team April 9, 2021 17:34

dnephin commented Apr 15, 2021

View reviewed changes

mikemorris added this to the 1.10.0 milestone Apr 15, 2021

rboyer approved these changes Apr 19, 2021

View reviewed changes

dnephin added 2 commits April 20, 2021 13:29

dnephin force-pushed the dnephin/handle-raft-apply-errors branch from 7191919 to 87cd3fc Compare April 20, 2021 17:29

vercel bot temporarily deployed to Preview – consul-ui-staging April 20, 2021 17:29 Inactive

vercel bot temporarily deployed to Preview – consul April 20, 2021 17:29 Inactive

dnephin merged commit ce6bf56 into master Apr 20, 2021

dnephin deleted the dnephin/handle-raft-apply-errors branch April 20, 2021 17:58

dnephin added the backport/1.10 label Apr 20, 2021

samsalisbury pushed a commit that referenced this pull request Apr 20, 2021

Merge pull request #9991 from hashicorp/dnephin/handle-raft-apply-errors

b26938e

Handle FSM.Apply errors in raftApply

dnephin mentioned this pull request Apr 20, 2021

rpc: some cleanup of canRetry and ForwardRPC #10075

Merged

dnephin mentioned this pull request Nov 26, 2021

ca: clean up unnecessary raft.Apply response checking #11673

Merged

Amier3 mentioned this pull request Jun 13, 2022

Changed deprecated method calls. #13424

Closed

3 tasks

tgross mentioned this pull request Mar 1, 2023

handle FSM.Apply errors in raftApply hashicorp/nomad#16287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle FSM.Apply errors in raftApply #9991

Handle FSM.Apply errors in raftApply #9991

dnephin commented Apr 9, 2021 •

edited

Loading

dnephin Apr 9, 2021

dnephin Apr 9, 2021

dnephin Apr 9, 2021

dnephin Apr 9, 2021

dnephin Apr 9, 2021

dnephin Apr 9, 2021

dnephin Apr 9, 2021

dnephin Apr 15, 2021 •

edited

Loading

rboyer Apr 19, 2021

dnephin Apr 19, 2021

dnephin Apr 20, 2021

rboyer left a comment

hashicorp-ci commented Apr 20, 2021

hashicorp-ci commented Apr 20, 2021

hashicorp-ci commented Apr 20, 2021

Handle FSM.Apply errors in raftApply #9991

Handle FSM.Apply errors in raftApply #9991

Conversation

dnephin commented Apr 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnephin Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rboyer left a comment

Choose a reason for hiding this comment

hashicorp-ci commented Apr 20, 2021

hashicorp-ci commented Apr 20, 2021

hashicorp-ci commented Apr 20, 2021

dnephin commented Apr 9, 2021 •

edited

Loading

dnephin Apr 15, 2021 •

edited

Loading