connect: agent leaf cert caching improvements #5091

banks · 2018-12-13T14:05:52Z

This PR contains a re-written CA leaf cache implementation that fixes several issues with the previous one. It also contains a small change to the agent/cache package that allows the new implementation to be cleaner. It's one PR to minimise sprawl of dependent PRs/branches and because the diffs are not too huge.

This is the first PR in a series related to agent caching in general. Others planned will fix several other known bugs and implement more complete rate limiting for CSR requests.

Cache State

One problem with the current leaf cache is that it needs to know the current cert so it can manage expiry correctly. This was implemented by it holding it's own cache of the actual certs from which it returned pointers that are stored in the agent/cache.

This has the following issues:

The internal cert cache had to become subtly aware of the outer cache behaviour, for example we didn't initially shard by token which would have leaked certs once revocation-by-token is available. This was fixed in a28e4a3 but is an example of why the pattern was problematic.
Arguably worse, the internal cache had not link to the outer cache's TTL mechanism. This means that certificates would never be freed from memory. Typically certs for same service would replace existing but that assumes the same token for the lifetime. In the following realistic cases we wold "leak" the cert (and key) in memory until agent restart:
- service stops on an agent - cert is left in memory. For high churn or dynamically named services this could grow significantly
- service remains consistently named but rotates its token every day. After 100 days there are now 100 cache entries that will never be freed.

The above issues are solved by simply passing the current cache value if any into each Fetch call. In addition to this though, cache types may need to store additional state that is not part of the cache result but can be used to maintain correct behaviour between calls. This is necessary for the other feature added below. To support this, an opaque State is added to each cache entry.

Now the cache type can delegate all storage to the cache implementation either in the Result if it's just the last result that's needed, or using the new State field for anything else. Both of these are cleaned up by the cache TTL and any future improvements we make to bound agent cache size.

This change is made in isolation to the cache package in 718cb2c

Updated Leaf Implementation

This solves several issues with the current CA leaf cache:

Cache leaking noted above
Connect: Leaf certs are re-created unnecessarily due to race #4479
The current implementation will re-generate all certificates on every CA config change even if the signing key didn't change. For example if a new CA is added and made active, every cert will be updated (fine), but then when the old CA is finally removed (no change in active CA) all certs would be updated a second time.

It also has a few other improvements:

Some jitter is applied after observing a root change to smear the initial influx of CSRs to the servers over 20 seconds (even if the current blocking query times out before then). This is not sufficient alone to solve thundering herds in large clusters but will provide a baseline of protection for servers which can then use rate limiting (mostly built but will be in an upcoming PR) to prevent from being overwhelmed. We might make this configurable eventually but I want to not for now because the UX of asking operators to figure out the "right" amount of smearing based on their server size, number of service instances etc. is kinda gross. This is just a baseline and rate limiting which is simpler to reason about will control for larger deployments.
The shared root watcher is a little more efficient while not being significantly more complex to reason about (and solves a few subtle bugs in the old one). Mostly thanks to re-using the Notify mechanism that was added since the original implementation.
The logic for when to renew certs before their expiry is much more sophisticated. It used to be hard coded to an hour before expiry but this was not ideal for a few reasons:
- we allow total cert lifetime to be as low as 1 hour but if you use that in practice you get a new cert every 10 mins since they are always considered "too close to expiring".
- we don't want lots of services started together after an outage, or in sync because of a server outage to remain in sync hammering the servers all at the same time every few days, so a broad jitter is preferable.

Misc Extras

A lot of the tests needed to be changed to generate more accurate mock data like certs that actually have the right info in PEM since we now depend on that.

There are a couple of technically unnecessary tweaks in here that I made as I went like populating a bunch more of the TestCA fields correctly and making sure the test serial numbers are actually representable in our CARoot struct which only has a uint64 for the serial number field.

Fixes #4479

mitchellh · 2018-12-13T16:15:31Z

agent/cache/type.go

+	// been other Fetch attempts that resulted in an error in the mean time. These
+	// are not explicitly represented currently. We could add that if needed this
+	// was just simpler for now.
+	LastResult *FetchResult


I would make it clear in the doc comment here: this is a pointer... can I modify any fields? Should I be very careful to definitely NOT modify any fields? etc.

Good call, yeah it should be treated as read only.

Actually, this comment was important - it turns out that even though the FetchResult is a new struct each time, because it's using interface{} for Value and State, it's actually really hard to avoid modifying the state directly in the cache entry during a Fetch!

While that isn't thread-unsafe because we guarantee only one Fetch per entry concurrently, it is really error prone since you might update the state and then hit an error and/or not update the index that goes with it which is super unintuitive.

Even if you return a struct or other value type and not a pointer as the state, if it contains any pointers things go bad.

So I'll update this wording here, but also the CA implementation in this PR currently does use a pointer which means we inadvertently update the cache entry directly.

banks · 2018-12-13T16:30:38Z

There is another bug I think this fixes from the old implementation (although I realise I have not added an explicit test for this and should):

If a root rotation triggers a renewal but the CSR RPC errors, we will NOT clear the state.forceExpireAfter which means all future Fetch attempts will keep retrying until they get a new cert.

Previously if the CSR RPC failed, the current blocking query would exit but then a subsequent fetch would not notice that it's cert needed renewing until it hard expired.

banks · 2018-12-20T15:21:03Z

Test failure on this looks legit in CI although it passed locally: TestAgentConnectCALeafCert_goodNotLocal

mkeeler · 2019-01-07T19:17:48Z

Without looking at the code at all: should this fix #4462 or #4463

If not should those fixes be rolled into here as well?

mkeeler · 2019-01-07T20:39:05Z

The remaining travis failure is legit too. I think TestSanitize just needs to get one of the new configuration items added into the expected JSON ("ConnectTestCALeafRootChangeSpread") as its being defaulted to "0s"

mkeeler

This is great. Just the one test fix needed. I especially liked all the very in-depth explanations in comments.

banks · 2019-01-07T21:14:39Z

Without looking at the code at all: should this fix #4462 or #4463

Certainly doesn't fix #4463 and that is out of scope - it needs new tables in state store in CA provider etc while this is an agent-only change.

#4462 I think is not fixed either because it's due to a deeper issue in the way we do background-refresh in the cache. I have other issues to look at in near future related to cache (and potentially refactoring/removing how background refresh works) that I think will be a better place to look into that.

…pes can safely store additional data that is eventually expired.

…sive testing for the Root change jitter across blocking requests, test concurrent fetches for different leaves interact nicely with rootsWatcher.

…state to use a non-pointer state.

… are deterministic again!

banks · 2019-01-08T10:11:35Z

@mkeeler all green!

mkeeler · 2019-01-08T14:12:21Z

@banks merge away

banks requested review from mitchellh and a team December 13, 2018 14:05

banks added this to the 1.4.1 milestone Dec 13, 2018

mitchellh reviewed Dec 13, 2018

View reviewed changes

banks force-pushed the leaf-cache branch from d4ef05d to 9ff057b Compare December 20, 2018 12:16

mkeeler requested changes Jan 7, 2019

View reviewed changes

banks added 7 commits January 7, 2019 21:17

Add State storage and LastResult argument into Cache so that cache.Ty…

6211adb

…pes can safely store additional data that is eventually expired.

New Leaf cache type working and basic tests passing. TODO: more exten…

bcbae2f

…sive testing for the Root change jitter across blocking requests, test concurrent fetches for different leaves interact nicely with rootsWatcher.

Add multi-client and delayed rotation tests.

8cf1458

Typos and cleanup error handling in roots watch

b4b32bd

Add comment about how the FetchResult can be used and change ca leaf …

36fcb86

…state to use a non-pointer state.

Plumb test override of root CA jitter through TestAgent so that tests…

55d297c

… are deterministic again!

Fix failing config test

4a550fa

banks force-pushed the leaf-cache branch from adfe787 to 4a550fa Compare January 7, 2019 21:19

mkeeler approved these changes Jan 8, 2019

View reviewed changes

banks merged commit 0638e09 into master Jan 10, 2019

banks deleted the leaf-cache branch January 10, 2019 12:46

banks mentioned this pull request Jan 16, 2019

connect: tame thundering herd of CSRs on CA rotation #5228

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect: agent leaf cert caching improvements #5091

connect: agent leaf cert caching improvements #5091

banks commented Dec 13, 2018 •

edited by mkeeler

Loading

mitchellh Dec 13, 2018

banks Dec 13, 2018

banks Dec 20, 2018

banks commented Dec 13, 2018

banks commented Dec 20, 2018

mkeeler commented Jan 7, 2019

mkeeler commented Jan 7, 2019

mkeeler left a comment

banks commented Jan 7, 2019

banks commented Jan 8, 2019

mkeeler commented Jan 8, 2019

connect: agent leaf cert caching improvements #5091

connect: agent leaf cert caching improvements #5091

Conversation

banks commented Dec 13, 2018 • edited by mkeeler Loading

Cache State

Updated Leaf Implementation

Misc Extras

mitchellh Dec 13, 2018

Choose a reason for hiding this comment

banks Dec 13, 2018

Choose a reason for hiding this comment

banks Dec 20, 2018

Choose a reason for hiding this comment

banks commented Dec 13, 2018

banks commented Dec 20, 2018

mkeeler commented Jan 7, 2019

mkeeler commented Jan 7, 2019

mkeeler left a comment

Choose a reason for hiding this comment

banks commented Jan 7, 2019

banks commented Jan 8, 2019

mkeeler commented Jan 8, 2019

banks commented Dec 13, 2018 •

edited by mkeeler

Loading