Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: agent leaf cert caching improvements #5091

Merged
merged 7 commits into from
Jan 10, 2019
Merged

connect: agent leaf cert caching improvements #5091

merged 7 commits into from
Jan 10, 2019

Conversation

banks
Copy link
Member

@banks banks commented Dec 13, 2018

This PR contains a re-written CA leaf cache implementation that fixes several issues with the previous one. It also contains a small change to the agent/cache package that allows the new implementation to be cleaner. It's one PR to minimise sprawl of dependent PRs/branches and because the diffs are not too huge.

This is the first PR in a series related to agent caching in general. Others planned will fix several other known bugs and implement more complete rate limiting for CSR requests.

Cache State

One problem with the current leaf cache is that it needs to know the current cert so it can manage expiry correctly. This was implemented by it holding it's own cache of the actual certs from which it returned pointers that are stored in the agent/cache.

This has the following issues:

  • The internal cert cache had to become subtly aware of the outer cache behaviour, for example we didn't initially shard by token which would have leaked certs once revocation-by-token is available. This was fixed in a28e4a3 but is an example of why the pattern was problematic.
  • Arguably worse, the internal cache had not link to the outer cache's TTL mechanism. This means that certificates would never be freed from memory. Typically certs for same service would replace existing but that assumes the same token for the lifetime. In the following realistic cases we wold "leak" the cert (and key) in memory until agent restart:
    • service stops on an agent - cert is left in memory. For high churn or dynamically named services this could grow significantly
    • service remains consistently named but rotates its token every day. After 100 days there are now 100 cache entries that will never be freed.

The above issues are solved by simply passing the current cache value if any into each Fetch call. In addition to this though, cache types may need to store additional state that is not part of the cache result but can be used to maintain correct behaviour between calls. This is necessary for the other feature added below. To support this, an opaque State is added to each cache entry.

Now the cache type can delegate all storage to the cache implementation either in the Result if it's just the last result that's needed, or using the new State field for anything else. Both of these are cleaned up by the cache TTL and any future improvements we make to bound agent cache size.

This change is made in isolation to the cache package in 718cb2c

Updated Leaf Implementation

This solves several issues with the current CA leaf cache:

  • Cache leaking noted above
  • Connect: Leaf certs are re-created unnecessarily due to race #4479
  • The current implementation will re-generate all certificates on every CA config change even if the signing key didn't change. For example if a new CA is added and made active, every cert will be updated (fine), but then when the old CA is finally removed (no change in active CA) all certs would be updated a second time.

It also has a few other improvements:

  • Some jitter is applied after observing a root change to smear the initial influx of CSRs to the servers over 20 seconds (even if the current blocking query times out before then). This is not sufficient alone to solve thundering herds in large clusters but will provide a baseline of protection for servers which can then use rate limiting (mostly built but will be in an upcoming PR) to prevent from being overwhelmed. We might make this configurable eventually but I want to not for now because the UX of asking operators to figure out the "right" amount of smearing based on their server size, number of service instances etc. is kinda gross. This is just a baseline and rate limiting which is simpler to reason about will control for larger deployments.
  • The shared root watcher is a little more efficient while not being significantly more complex to reason about (and solves a few subtle bugs in the old one). Mostly thanks to re-using the Notify mechanism that was added since the original implementation.
  • The logic for when to renew certs before their expiry is much more sophisticated. It used to be hard coded to an hour before expiry but this was not ideal for a few reasons:
    • we allow total cert lifetime to be as low as 1 hour but if you use that in practice you get a new cert every 10 mins since they are always considered "too close to expiring".
    • we don't want lots of services started together after an outage, or in sync because of a server outage to remain in sync hammering the servers all at the same time every few days, so a broad jitter is preferable.

Misc Extras

A lot of the tests needed to be changed to generate more accurate mock data like certs that actually have the right info in PEM since we now depend on that.

There are a couple of technically unnecessary tweaks in here that I made as I went like populating a bunch more of the TestCA fields correctly and making sure the test serial numbers are actually representable in our CARoot struct which only has a uint64 for the serial number field.

Fixes #4479

@banks banks requested review from mitchellh and a team December 13, 2018 14:05
@banks banks added this to the 1.4.1 milestone Dec 13, 2018
// been other Fetch attempts that resulted in an error in the mean time. These
// are not explicitly represented currently. We could add that if needed this
// was just simpler for now.
LastResult *FetchResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it clear in the doc comment here: this is a pointer... can I modify any fields? Should I be very careful to definitely NOT modify any fields? etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, yeah it should be treated as read only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this comment was important - it turns out that even though the FetchResult is a new struct each time, because it's using interface{} for Value and State, it's actually really hard to avoid modifying the state directly in the cache entry during a Fetch!

While that isn't thread-unsafe because we guarantee only one Fetch per entry concurrently, it is really error prone since you might update the state and then hit an error and/or not update the index that goes with it which is super unintuitive.

Even if you return a struct or other value type and not a pointer as the state, if it contains any pointers things go bad.

So I'll update this wording here, but also the CA implementation in this PR currently does use a pointer which means we inadvertently update the cache entry directly.

@banks
Copy link
Member Author

banks commented Dec 13, 2018

There is another bug I think this fixes from the old implementation (although I realise I have not added an explicit test for this and should):

If a root rotation triggers a renewal but the CSR RPC errors, we will NOT clear the state.forceExpireAfter which means all future Fetch attempts will keep retrying until they get a new cert.

Previously if the CSR RPC failed, the current blocking query would exit but then a subsequent fetch would not notice that it's cert needed renewing until it hard expired.

@banks
Copy link
Member Author

banks commented Dec 20, 2018

Test failure on this looks legit in CI although it passed locally: TestAgentConnectCALeafCert_goodNotLocal

@mkeeler
Copy link
Member

mkeeler commented Jan 7, 2019

Without looking at the code at all: should this fix #4462 or #4463

If not should those fixes be rolled into here as well?

@mkeeler
Copy link
Member

mkeeler commented Jan 7, 2019

The remaining travis failure is legit too. I think TestSanitize just needs to get one of the new configuration items added into the expected JSON ("ConnectTestCALeafRootChangeSpread") as its being defaulted to "0s"

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Just the one test fix needed. I especially liked all the very in-depth explanations in comments.

@banks
Copy link
Member Author

banks commented Jan 7, 2019

Without looking at the code at all: should this fix #4462 or #4463

Certainly doesn't fix #4463 and that is out of scope - it needs new tables in state store in CA provider etc while this is an agent-only change.

#4462 I think is not fixed either because it's due to a deeper issue in the way we do background-refresh in the cache. I have other issues to look at in near future related to cache (and potentially refactoring/removing how background refresh works) that I think will be a better place to look into that.

banks added 7 commits January 7, 2019 21:17
…pes can safely store additional data that is eventually expired.
…sive testing for the Root change jitter across blocking requests, test concurrent fetches for different leaves interact nicely with rootsWatcher.
@banks
Copy link
Member Author

banks commented Jan 8, 2019

@mkeeler all green!

@mkeeler
Copy link
Member

mkeeler commented Jan 8, 2019

@banks merge away

@banks banks merged commit 0638e09 into master Jan 10, 2019
@banks banks deleted the leaf-cache branch January 10, 2019 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connect: Leaf certs are re-created unnecessarily due to race
3 participants