Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some data races #10396

Merged
merged 8 commits into from
Jul 16, 2021
Merged

Fix some data races #10396

merged 8 commits into from
Jul 16, 2021

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Jun 14, 2021

Related to #8329
Fixes #9458

These commits are extracted from #10341. There's still a lot more work required to properly solve #9457 (ex: hashicorp/memberlist#238), so I'd like to get some of the data race fixes merged to prevent them from rotting, and so that contributors can run the race detector locally with much less noise.

@dnephin dnephin added theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization theme/testing Testing, and related enhancements labels Jun 14, 2021
@dnephin dnephin requested a review from a team June 14, 2021 20:07
@github-actions github-actions bot added pr/dependencies PR specifically updates dependencies of project theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics and removed theme/testing Testing, and related enhancements labels Jun 14, 2021
ID: "node-wr",
Name: "node-wr",
Description: "node-wr",
Rules: `node_prefix "" { policy = "write"}`,
Syntax: acl.SyntaxCurrent,
Datacenters: []string{"dc1"},
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2},
}, nil
}
p.SetHash(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about this fix. Why does the ACL resolver need to call SetHash ? Shouldn't that be done when the policy is created instead of when it is read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do set the hash on creation in the PolicySet endpoint: https://github.com/hashicorp/consul/blob/main/agent/consul/acl_endpoint.go#L1204

testPolicyForID is used by a test resolver delegate, so I think it's necessary to do it on reads here because there is no write beyond the policy existing in this function. It resolves these hard coded policies as if they had been created by some other process. (But maybe I'm misunderstanding your question)

Copy link
Contributor Author

@dnephin dnephin Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think that suggests that this fix is reasonable (because it better matches a production flow), which is reassuring.

I think my broader question is about the call to SetHash that caused the data race. The call in resolveWithCache (here).

That seems like a very strange place to be calling SetHash. From my reading of the code this method is called in read flows, which should already have a hash set from when the policy was written. And if not, why would SetHash be called here, and not in a more core place (like the state store or RPC handler) ? My thinking is that if this SetHash call is misplaced, it has the potential to cause or hide bugs, either now or in the future. I guess maybe it was done defensively because we are about to read the hash, but I think sometimes this kind of defense can actually introduce other problems.

I think answering that question isn't a blocker for this PR, but it may be worth exploring further to see if we can rationalize where SetHash is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point about resolveWithCache. I'm not sure why it's done there. At first I thought it could have been a new field, where policies before some version doesn't have it, but the Hash field has existed since the New ACLs PR.

It also doesn't seem to be used for ACL compatibility, since the function to extract policies from embedded tokens sets the hash itself:

policy.SetHash(true)

Same with these other synthetic policy extractors:

policy.SetHash(true)

policy.SetHash(true)

My sense is that it was there for tests like this one that don't set the hash in advance. I don't see a non-test use for it.

Agreed that it doesn't need to be handled in this PR.

@dnephin dnephin added the pr/no-changelog PR does not need a corresponding .changelog entry label Jun 14, 2021
@dnephin dnephin force-pushed the dnephin/fix-more-data-races branch from dbf327a to 71719d2 Compare July 5, 2021 17:59
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 5, 2021 17:59 Inactive
@vercel vercel bot temporarily deployed to Preview – consul July 5, 2021 17:59 Inactive
@dnephin dnephin force-pushed the dnephin/fix-more-data-races branch from 71719d2 to 3e9b693 Compare July 5, 2021 18:10
@vercel vercel bot temporarily deployed to Preview – consul July 5, 2021 18:10 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 5, 2021 18:10 Inactive
Some global variables are patched to shorter values in these tests. But the goroutines that read
them can outlive the test because nothing waited for them to exit.

This commit adds a Wait() method to the routine manager, so that tests can wait for the goroutines
to exit. This prevents the data race because the 'reset to original value' can happen
after all other goroutines have stopped.
The LogOutput io.Writer used by TestAgent must allow concurrent reads and writes, and a
bytes.Buffer does not allow this. The bytes.Buffer must be wrapped with a lock to make this safe.
The dnsConfig pulled from the atomic.Value is a pointer, so modifying it in place
creates a data race. Use the exported ReloadConfig interface instead.
The test was modifying a pointer to a struct that had been passed to
another goroutine. Instead create a new struct to modify.

```
WARNING: DATA RACE
Write at 0x00c01407c3c0 by goroutine 832:
  github.com/hashicorp/consul/agent.TestServiceManager_PersistService_API()
      /home/daniel/pers/code/consul/agent/service_manager_test.go:446 +0x1d86
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202

Previous read at 0x00c01407c3c0 by goroutine 938:
  reflect.typedmemmove()
      /usr/lib/go/src/runtime/mbarrier.go:177 +0x0
  reflect.Value.Set()
      /usr/lib/go/src/reflect/value.go:1569 +0x13b
  github.com/mitchellh/copystructure.(*walker).Primitive()
      /home/daniel/go/pkg/mod/github.com/mitchellh/copystructure@v1.0.0/copystructure.go:289 +0x190
  github.com/mitchellh/reflectwalk.walkPrimitive()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:252 +0x31b
  github.com/mitchellh/reflectwalk.walk()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:179 +0x24d
  github.com/mitchellh/reflectwalk.walkStruct()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:386 +0x4ec
  github.com/mitchellh/reflectwalk.walk()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:188 +0x656
  github.com/mitchellh/reflectwalk.walkStruct()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:386 +0x4ec
  github.com/mitchellh/reflectwalk.walk()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:188 +0x656
  github.com/mitchellh/reflectwalk.Walk()
      /home/daniel/go/pkg/mod/github.com/mitchellh/reflectwalk@v1.0.1/reflectwalk.go:92 +0x164
  github.com/mitchellh/copystructure.Config.Copy()
      /home/daniel/go/pkg/mod/github.com/mitchellh/copystructure@v1.0.0/copystructure.go:69 +0xe7
  github.com/mitchellh/copystructure.Copy()
      /home/daniel/go/pkg/mod/github.com/mitchellh/copystructure@v1.0.0/copystructure.go:13 +0x84
  github.com/hashicorp/consul/agent.mergeServiceConfig()
      /home/daniel/pers/code/consul/agent/service_manager.go:362 +0x56
  github.com/hashicorp/consul/agent.(*serviceConfigWatch).handleUpdate()
      /home/daniel/pers/code/consul/agent/service_manager.go:279 +0x250
  github.com/hashicorp/consul/agent.(*serviceConfigWatch).runWatch()
      /home/daniel/pers/code/consul/agent/service_manager.go:246 +0x2d4

Goroutine 832 (running) created at:
  testing.(*T).Run()
      /usr/lib/go/src/testing/testing.go:1238 +0x5d7
  testing.runTests.func1()
      /usr/lib/go/src/testing/testing.go:1511 +0xa6
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202
  testing.runTests()
      /usr/lib/go/src/testing/testing.go:1509 +0x612
  testing.(*M).Run()
      /usr/lib/go/src/testing/testing.go:1417 +0x3b3
  main.main()
      _testmain.go:1181 +0x236

Goroutine 938 (running) created at:
  github.com/hashicorp/consul/agent.(*serviceConfigWatch).start()
      /home/daniel/pers/code/consul/agent/service_manager.go:223 +0x4e4
  github.com/hashicorp/consul/agent.(*ServiceManager).AddService()
      /home/daniel/pers/code/consul/agent/service_manager.go:98 +0x344
  github.com/hashicorp/consul/agent.(*Agent).addServiceLocked()
      /home/daniel/pers/code/consul/agent/agent.go:1942 +0x2e4
  github.com/hashicorp/consul/agent.(*Agent).AddService()
      /home/daniel/pers/code/consul/agent/agent.go:1929 +0x337
  github.com/hashicorp/consul/agent.TestServiceManager_PersistService_API()
      /home/daniel/pers/code/consul/agent/service_manager_test.go:400 +0x17c4
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202

```
To pick up data race fixes
By setting the hash when we create the policy.

```
WARNING: DATA RACE
Read at 0x00c0028b4b10 by goroutine 1182:
  github.com/hashicorp/consul/agent/structs.(*ACLPolicy).SetHash()
      /home/daniel/pers/code/consul/agent/structs/acl.go:701 +0x40d
  github.com/hashicorp/consul/agent/structs.ACLPolicies.resolveWithCache()
      /home/daniel/pers/code/consul/agent/structs/acl.go:779 +0xfe
  github.com/hashicorp/consul/agent/structs.ACLPolicies.Compile()
      /home/daniel/pers/code/consul/agent/structs/acl.go:809 +0xf1
  github.com/hashicorp/consul/agent/consul.(*ACLResolver).ResolveTokenToIdentityAndAuthorizer()
      /home/daniel/pers/code/consul/agent/consul/acl.go:1226 +0x6ef
  github.com/hashicorp/consul/agent/consul.resolveTokenAsync()
      /home/daniel/pers/code/consul/agent/consul/acl_test.go:66 +0x5c

Previous write at 0x00c0028b4b10 by goroutine 1509:
  github.com/hashicorp/consul/agent/structs.(*ACLPolicy).SetHash()
      /home/daniel/pers/code/consul/agent/structs/acl.go:730 +0x3a8
  github.com/hashicorp/consul/agent/structs.ACLPolicies.resolveWithCache()
      /home/daniel/pers/code/consul/agent/structs/acl.go:779 +0xfe
  github.com/hashicorp/consul/agent/structs.ACLPolicies.Compile()
      /home/daniel/pers/code/consul/agent/structs/acl.go:809 +0xf1
  github.com/hashicorp/consul/agent/consul.(*ACLResolver).ResolveTokenToIdentityAndAuthorizer()
      /home/daniel/pers/code/consul/agent/consul/acl.go:1226 +0x6ef
  github.com/hashicorp/consul/agent/consul.resolveTokenAsync()
      /home/daniel/pers/code/consul/agent/consul/acl_test.go:66 +0x5c

Goroutine 1182 (running) created at:
  github.com/hashicorp/consul/agent/consul.TestACLResolver_Client.func4()
      /home/daniel/pers/code/consul/agent/consul/acl_test.go:1669 +0x459
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202

Goroutine 1509 (running) created at:
  github.com/hashicorp/consul/agent/consul.TestACLResolver_Client.func4()
      /home/daniel/pers/code/consul/agent/consul/acl_test.go:1668 +0x415
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1193 +0x202
```
@dnephin dnephin force-pushed the dnephin/fix-more-data-races branch from 3e9b693 to fa47c04 Compare July 14, 2021 22:58
@vercel vercel bot temporarily deployed to Preview – consul July 14, 2021 22:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 14, 2021 22:58 Inactive
@@ -319,14 +319,16 @@ func testPolicyForID(policyID string) (bool, *structs.ACLPolicy, error) {
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2},
}, nil
case "acl-wr":
return true, &structs.ACLPolicy{
p := &structs.ACLPolicy{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this change was only needed for a subset of the cases?

Copy link
Contributor Author

@dnephin dnephin Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is because only the "Concurrent-Token-Resolve" test cases has a race on SetHash, and only these two cases are used by the "Concurrent-Token-Resolve" test case.

I can add SetHash to the other cases as well, if we think that would better imitate production.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding it to all of them would be nice for that reason. Would also avoid that confusion when someone looks at this again later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Added those in the latest commit

Copy link
Contributor Author

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing this PR! I think those are excellent questions, I've replied below.

agent/service_manager_test.go Show resolved Hide resolved
@@ -319,14 +319,16 @@ func testPolicyForID(policyID string) (bool, *structs.ACLPolicy, error) {
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2},
}, nil
case "acl-wr":
return true, &structs.ACLPolicy{
p := &structs.ACLPolicy{
Copy link
Contributor Author

@dnephin dnephin Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is because only the "Concurrent-Token-Resolve" test cases has a race on SetHash, and only these two cases are used by the "Concurrent-Token-Resolve" test case.

I can add SetHash to the other cases as well, if we think that would better imitate production.

ID: "node-wr",
Name: "node-wr",
Description: "node-wr",
Rules: `node_prefix "" { policy = "write"}`,
Syntax: acl.SyntaxCurrent,
Datacenters: []string{"dc1"},
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2},
}, nil
}
p.SetHash(false)
Copy link
Contributor Author

@dnephin dnephin Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think that suggests that this fix is reasonable (because it better matches a production flow), which is reassuring.

I think my broader question is about the call to SetHash that caused the data race. The call in resolveWithCache (here).

That seems like a very strange place to be calling SetHash. From my reading of the code this method is called in read flows, which should already have a hash set from when the policy was written. And if not, why would SetHash be called here, and not in a more core place (like the state store or RPC handler) ? My thinking is that if this SetHash call is misplaced, it has the potential to cause or hide bugs, either now or in the future. I guess maybe it was done defensively because we are about to read the hash, but I think sometimes this kind of defense can actually introduce other problems.

I think answering that question isn't a blocker for this PR, but it may be worth exploring further to see if we can rationalize where SetHash is called.

Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have one request below about calling SetHash for all of the test policies, but will approve the PR anyways

@@ -319,14 +319,16 @@ func testPolicyForID(policyID string) (bool, *structs.ACLPolicy, error) {
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2},
}, nil
case "acl-wr":
return true, &structs.ACLPolicy{
p := &structs.ACLPolicy{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding it to all of them would be nice for that reason. Would also avoid that confusion when someone looks at this again later.

ID: "node-wr",
Name: "node-wr",
Description: "node-wr",
Rules: `node_prefix "" { policy = "write"}`,
Syntax: acl.SyntaxCurrent,
Datacenters: []string{"dc1"},
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2},
}, nil
}
p.SetHash(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point about resolveWithCache. I'm not sure why it's done there. At first I thought it could have been a new field, where policies before some version doesn't have it, but the Hash field has existed since the New ACLs PR.

It also doesn't seem to be used for ACL compatibility, since the function to extract policies from embedded tokens sets the hash itself:

policy.SetHash(true)

Same with these other synthetic policy extractors:

policy.SetHash(true)

policy.SetHash(true)

My sense is that it was there for tests like this one that don't set the hash in advance. I don't see a non-test use for it.

Agreed that it doesn't need to be handled in this PR.

agent/service_manager_test.go Show resolved Hide resolved
A previous commit used SetHash on two of the cases to fix a data race. This commit applies
that change to all cases. Using SetHash in this test helper should ensure that the
test helper behaves closer to production.
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging July 16, 2021 22:00 Inactive
@vercel vercel bot temporarily deployed to Preview – consul July 16, 2021 22:00 Inactive
@dnephin
Copy link
Contributor Author

dnephin commented Jul 16, 2021

The one test failure (TestConnectCAConfig_GetSetForceNoCrossSigning) is a known flake and passed locally, so going to merge.

@dnephin dnephin merged commit 499250c into main Jul 16, 2021
@dnephin dnephin deleted the dnephin/fix-more-data-races branch July 16, 2021 22:21
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/412138.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/dependencies PR specifically updates dependencies of project pr/no-changelog PR does not need a corresponding .changelog entry theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data race: TestLeader_Vault_PrimaryCA_IntermediateRenew modifies a global without a lock
3 participants