-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes accidental state store updates from output-side fixups. #3867
Conversation
Without this fix, looking at the health endpoint from a server to see if there's churn could cause the state store to get modified in a way that causes churn! |
I updated my test-cluster with your patch and no unnecessary service/check updates anymore. |
Thank you for running an integrated check on this - it was a weird one!
…On Feb 7, 2018 5:30 AM, "Wim" ***@***.***> wrote:
I updated my test-cluster with your patch and no unnecessary service/check
updates anymore.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3867 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApG5X8Jxnl_bor4UnSknXtgqanHaFSsks5tSaWSgaJpZM4R8IzP>
.
|
args := &structs.ServiceSpecificRequest{ | ||
Datacenter: "dc1", | ||
ServiceName: "redis", | ||
} | ||
var before structs.IndexedHealthChecks | ||
if err := a.RPC("Health.ServiceChecks", args, &before); err != nil { | ||
var before structs.IndexedCheckServiceNodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the test have to change from calling Health.ServiceChecks to Health.ServiceNodes to illustrate this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem wasn't uncovered by this test since you actually have to hit specific HTTP API endpoints to corrupt the data, but I did add a node-level check as part of trying to chase this down, so I changed this RPC endpoint to one that returns the service's check + the node-level checks to make this test more robust. It was when I could see this test passing but run this in an integrated fashion and watching /v1/health/service/ and see churn that I knew something super weird was happening!
if c.ServiceTags == nil { | ||
c.ServiceTags = make([]string, 0) | ||
clone := *c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what issue you are referring to with this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You fixed it in c2a59f1, around line 150.
Did a sweep of 84d6ac2 and checked them all.
Linking to #2940 where these issues were introduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now
While looking into the follow up PR #3866 we realized there was something strange going on (see this comment):
It turns out that this was a super subtle manifestation of some output-side fixups attempting to avoid
null
in JSON output for empty slices. Here's an example of some offending code - https://github.com/hashicorp/consul/blob/v1.0.4/agent/agent_endpoint.go#L172-L178. Since the slice is of pointers, this code is actually updating the underlying service objects in a racy way, when all it wants to do is fixup the output JSON.Here's an even more insidious example - https://github.com/hashicorp/consul/blob/v1.0.4/agent/health_endpoint.go#L45-L49. For a client agent this is OK, since it's messing with a copy of an object that is received from the RPC call, which is a throwaway thing. If you hit this endpoint from a server, though, the in-memory RPC handler doesn't do a copy as a performance optimization, so you end up getting pointers back into the state store. Since this is immutable, that's totally fine, but it's not fine to update those objects.
Because we are subtly corrupting state, but in a kind of an equivalent way (subbing a
make([]string, 0)
for anil
slice), things mostly work OK, except in the anti-entropy syncing code which ends up in https://github.com/hashicorp/consul/blob/v1.0.4/agent/structs/structs.go#L514, which doesn't consider those things equal.By auditing for code like this and properly copying these objects before modifying, we can avoid this corruption, which is what this PR does. By fixing it at what we think is the root cause, I think we can close #3866, which I think was working around the issue.