Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle FSM.Apply errors in raftApply #9991

Merged
merged 2 commits into from
Apr 20, 2021
Merged

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Apr 9, 2021

Fixes #9970

Previously we were inconsistently checking the response for errors. This
PR moves the response-is-error check into raftApply, so that all callers
can look at only the error response, instead of having to know that
errors could come from two places.

This should expose a few more errors that were previously hidden because
some calls to raftApply were ignoring the response return value.

Also handle errors more consistently. In some cases we would log the
error before returning it. This can be very confusing because it can
result in the same error being logged multiple times. Instead return
a wrapped error.

Also fixes a bug with canRetry and chunking errors. Previously it never would
have retried on those errors because it was looking at the wrong arg.

@dnephin dnephin added theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization pr/no-changelog PR does not need a corresponding .changelog entry labels Apr 9, 2021
Comment on lines 734 to -742
// Purge the identity from the cache to prevent using the previous definition of the identity
a.srv.acls.cache.RemoveIdentity(tokenSecretCacheID(token.SecretID))

if respErr, ok := resp.(error); ok {
return respErr
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of a few places where we were treating these errors differently. As far as I can tell, if the FSM.Apply returns an error, we shouldn't consider the operation a success, so I suspect this new behaviour is actually more correct.

We keep the identity in the cache if the FSM.Apply errors, because the operation did not complete.

There are a bunch more of these cases in this file.

Comment on lines +733 to +754
// raftApplyMsgpack encodes the msg using msgpack and calls raft.Apply. See
// raftApplyWithEncoder.
// Deprecated: use raftApplyMsgpack
func (s *Server) raftApply(t structs.MessageType, msg interface{}) (interface{}, error) {
return s.raftApplyMsgpack(t, msg)
}

// raftApplyMsgpack will msgpack encode the request and then run it through raft,
// then return the FSM response along with any errors.
// raftApplyMsgpack encodes the msg using msgpack and calls raft.Apply. See
// raftApplyWithEncoder.
func (s *Server) raftApplyMsgpack(t structs.MessageType, msg interface{}) (interface{}, error) {
return s.raftApplyWithEncoder(t, msg, structs.Encode)
}

// raftApplyProtobuf will protobuf encode the request and then run it through raft,
// then return the FSM response along with any errors.
// raftApplyProtobuf encodes the msg using protobuf and calls raft.Apply. See
// raftApplyWithEncoder.
func (s *Server) raftApplyProtobuf(t structs.MessageType, msg interface{}) (interface{}, error) {
return s.raftApplyWithEncoder(t, msg, structs.EncodeProtoInterface)
}

// raftApplyWithEncoder is used to encode a message, run it through raft,
// and return the FSM response along with any errors. Unlike raftApply this
// takes the encoder to use as an argument.
func (s *Server) raftApplyWithEncoder(t structs.MessageType, msg interface{}, encoder raftEncoder) (interface{}, error) {
// raftApplyWithEncoder encodes a message, and then calls raft.Apply with the
// encoded message. Returns the FSM response along with any errors. If the
// FSM.Apply response is an error it will be returned as the error return
// value with a nil response.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the docs here and deprecated the old raftApply.

Comment on lines +799 to +810
return nil, ErrChunkingResubmit
}
// We expect that this conversion should always work
chunkedSuccess, ok := resp.(raftchunking.ChunkingSuccess)
if !ok {
return nil, errors.New("unknown type of response back from chunking FSM")
}
// Return the inner wrapped response
return chunkedSuccess.Response, nil
resp = chunkedSuccess.Response
}

if err, ok := resp.(error); ok {
return nil, err
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the change.

@@ -160,10 +160,6 @@ func (s *Session) Apply(args *structs.SessionRequest, reply *string) error {
s.srv.clearSessionTimer(args.Session.ID)
}

if respErr, ok := resp.(error); ok {
return respErr
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case where we handled the errors differently.

@@ -789,17 +796,19 @@ func (s *Server) raftApplyWithEncoder(t structs.MessageType, msg interface{}, en
// apply function. Downstream client code expects to see any error
// from the FSM (as opposed to the apply itself) and decide whether
// it can retry in the future's response.
return ErrChunkingResubmit, nil
return nil, ErrChunkingResubmit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this return nil, ErrChunkingResubmit part of the change I need to look at. The handling of this in canRetry is strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, the handling of ErrChunkingResubmit seems completely broken. It's only ever referenced in one place here: https://github.com/hashicorp/consul/blob/master/agent/consul/rpc.go#L530-L539

But looking at the callers to canRetry, there's no way that an error is ever going to be passed as the args argument:

So it looks like this was always broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed another commit which should fix this problem.

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 9, 2021 17:34 Inactive
@vercel vercel bot temporarily deployed to Preview – consul April 9, 2021 17:34 Inactive
@dnephin dnephin requested a review from a team April 9, 2021 17:34
Comment on lines -537 to +539
// If we are chunking and it doesn't seem to have completed, try again
intErr, ok := args.(error)
if ok && strings.Contains(intErr.Error(), ErrChunkingResubmit.Error()) {
// If we are chunking and it doesn't seem to have completed, try again.
if err != nil && strings.Contains(err.Error(), ErrChunkingResubmit.Error()) {
Copy link
Contributor Author

@dnephin dnephin Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix for the second bug I noticed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case that exercises this ErrChunkingResubmit flow to verify this is a-ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must not be, because it was previously broken. I'm not sure how chunking works, I'll see if I can write a test for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on a test in a follow up. I think there is more cleanup to do around canRetry to prevent similar bugs in the future.

@mikemorris mikemorris added this to the 1.10.0 milestone Apr 15, 2021
Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with small nonblocking comments)

Previously we were inconsistently checking the response for errors. This
PR moves the response-is-error check into raftApply, so that all callers
can look at only the error response, instead of having to know that
errors could come from two places.

This should expose a few more errors that were previously hidden because
in some calls to raftApply we were ignoring the response return value.

Also handle errors more consistently. In some cases we would log the
error before returning it. This can be very confusing because it can
result in the same error being logged multiple times. Instead return
a wrapped error.
Previously canRetry was attempting to retrieve this error from args, however there was never
any callers that would pass an error to args.

With the change to raftApply to move this error to the error return value, it is now possible
to receive this error from the err argument.

This commit updates canRetry to check for ErrChunkingResubmit in err.
@dnephin dnephin force-pushed the dnephin/handle-raft-apply-errors branch from 7191919 to 87cd3fc Compare April 20, 2021 17:29
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 20, 2021 17:29 Inactive
@vercel vercel bot temporarily deployed to Preview – consul April 20, 2021 17:29 Inactive
@dnephin dnephin merged commit ce6bf56 into master Apr 20, 2021
@dnephin dnephin deleted the dnephin/handle-raft-apply-errors branch April 20, 2021 17:58
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/354044.

@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/354070.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit ce6bf56 onto release/1.10.x succeeded!

samsalisbury pushed a commit that referenced this pull request Apr 20, 2021
@Amier3 Amier3 mentioned this pull request Jun 13, 2022
3 tasks
tgross added a commit to hashicorp/nomad that referenced this pull request Mar 2, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
tgross added a commit to hashicorp/nomad that referenced this pull request Mar 2, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
tgross added a commit to hashicorp/nomad that referenced this pull request Mar 2, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
tgross added a commit to hashicorp/nomad that referenced this pull request Mar 2, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
tgross added a commit to hashicorp/nomad that referenced this pull request Mar 2, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
tgross added a commit to hashicorp/nomad that referenced this pull request Mar 2, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
philrenaud pushed a commit to hashicorp/nomad that referenced this pull request Mar 14, 2023
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.

Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
hashicorp/consul#9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:

* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
  discovery or CSI plugin states

Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-changelog PR does not need a corresponding .changelog entry theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle FSM.Apply errors in Server.raftApply to avoid missing the error
4 participants