Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes accidental state store updates from output-side fixups. #3867

Merged
merged 3 commits into from
Feb 7, 2018

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Feb 7, 2018

While looking into the follow up PR #3866 we realized there was something strange going on (see this comment):

While I was testing I noticed that node-level checks also have churn now as well, so I'm going to expand the unit test coverage and try to nail all the combinations so we know we've got a complete fix for this.

It turns out that this was a super subtle manifestation of some output-side fixups attempting to avoid null in JSON output for empty slices. Here's an example of some offending code - https://github.com/hashicorp/consul/blob/v1.0.4/agent/agent_endpoint.go#L172-L178. Since the slice is of pointers, this code is actually updating the underlying service objects in a racy way, when all it wants to do is fixup the output JSON.

Here's an even more insidious example - https://github.com/hashicorp/consul/blob/v1.0.4/agent/health_endpoint.go#L45-L49. For a client agent this is OK, since it's messing with a copy of an object that is received from the RPC call, which is a throwaway thing. If you hit this endpoint from a server, though, the in-memory RPC handler doesn't do a copy as a performance optimization, so you end up getting pointers back into the state store. Since this is immutable, that's totally fine, but it's not fine to update those objects.

Because we are subtly corrupting state, but in a kind of an equivalent way (subbing a make([]string, 0) for a nil slice), things mostly work OK, except in the anti-entropy syncing code which ends up in https://github.com/hashicorp/consul/blob/v1.0.4/agent/structs/structs.go#L514, which doesn't consider those things equal.

By auditing for code like this and properly copying these objects before modifying, we can avoid this corruption, which is what this PR does. By fixing it at what we think is the root cause, I think we can close #3866, which I think was working around the issue.

@slackpad slackpad added the type/bug Feature does not function as expected label Feb 7, 2018
@slackpad slackpad added this to the 1.0.5 milestone Feb 7, 2018
@slackpad
Copy link
Contributor Author

slackpad commented Feb 7, 2018

Without this fix, looking at the health endpoint from a server to see if there's churn could cause the state store to get modified in a way that causes churn!

@42wim
Copy link
Contributor

42wim commented Feb 7, 2018

I updated my test-cluster with your patch and no unnecessary service/check updates anymore.

@slackpad
Copy link
Contributor Author

slackpad commented Feb 7, 2018 via email

args := &structs.ServiceSpecificRequest{
Datacenter: "dc1",
ServiceName: "redis",
}
var before structs.IndexedHealthChecks
if err := a.RPC("Health.ServiceChecks", args, &before); err != nil {
var before structs.IndexedCheckServiceNodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the test have to change from calling Health.ServiceChecks to Health.ServiceNodes to illustrate this problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem wasn't uncovered by this test since you actually have to hit specific HTTP API endpoints to corrupt the data, but I did add a node-level check as part of trying to chase this down, so I changed this RPC endpoint to one that returns the service's check + the node-level checks to make this test more robust. It was when I could see this test passing but run this in an integrated fashion and watching /v1/health/service/ and see churn that I knew something super weird was happening!

if c.ServiceTags == nil {
c.ServiceTags = make([]string, 0)
clone := *c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what issue you are referring to with this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You fixed it in c2a59f1, around line 150.

Did a sweep of 84d6ac2
and checked them all.
@slackpad
Copy link
Contributor Author

slackpad commented Feb 7, 2018

Linking to #2940 where these issues were introduced.

Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants