-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect: agent leaf cert caching improvements #5091
Conversation
// been other Fetch attempts that resulted in an error in the mean time. These | ||
// are not explicitly represented currently. We could add that if needed this | ||
// was just simpler for now. | ||
LastResult *FetchResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it clear in the doc comment here: this is a pointer... can I modify any fields? Should I be very careful to definitely NOT modify any fields? etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, yeah it should be treated as read only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this comment was important - it turns out that even though the FetchResult is a new struct each time, because it's using interface{}
for Value
and State
, it's actually really hard to avoid modifying the state directly in the cache entry during a Fetch!
While that isn't thread-unsafe because we guarantee only one Fetch per entry concurrently, it is really error prone since you might update the state and then hit an error and/or not update the index that goes with it which is super unintuitive.
Even if you return a struct or other value type and not a pointer as the state, if it contains any pointers things go bad.
So I'll update this wording here, but also the CA implementation in this PR currently does use a pointer which means we inadvertently update the cache entry directly.
There is another bug I think this fixes from the old implementation (although I realise I have not added an explicit test for this and should): If a root rotation triggers a renewal but the CSR RPC errors, we will NOT clear the Previously if the CSR RPC failed, the current blocking query would exit but then a subsequent fetch would not notice that it's cert needed renewing until it hard expired. |
Test failure on this looks legit in CI although it passed locally: |
The remaining travis failure is legit too. I think TestSanitize just needs to get one of the new configuration items added into the expected JSON ("ConnectTestCALeafRootChangeSpread") as its being defaulted to "0s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Just the one test fix needed. I especially liked all the very in-depth explanations in comments.
Certainly doesn't fix #4463 and that is out of scope - it needs new tables in state store in CA provider etc while this is an agent-only change. #4462 I think is not fixed either because it's due to a deeper issue in the way we do background-refresh in the cache. I have other issues to look at in near future related to cache (and potentially refactoring/removing how background refresh works) that I think will be a better place to look into that. |
…pes can safely store additional data that is eventually expired.
…sive testing for the Root change jitter across blocking requests, test concurrent fetches for different leaves interact nicely with rootsWatcher.
…state to use a non-pointer state.
… are deterministic again!
@mkeeler all green! |
@banks merge away |
This PR contains a re-written CA leaf cache implementation that fixes several issues with the previous one. It also contains a small change to the
agent/cache
package that allows the new implementation to be cleaner. It's one PR to minimise sprawl of dependent PRs/branches and because the diffs are not too huge.This is the first PR in a series related to agent caching in general. Others planned will fix several other known bugs and implement more complete rate limiting for CSR requests.
Cache State
One problem with the current leaf cache is that it needs to know the current cert so it can manage expiry correctly. This was implemented by it holding it's own cache of the actual certs from which it returned pointers that are stored in the
agent/cache
.This has the following issues:
The above issues are solved by simply passing the current cache value if any into each
Fetch
call. In addition to this though, cache types may need to store additional state that is not part of the cache result but can be used to maintain correct behaviour between calls. This is necessary for the other feature added below. To support this, an opaqueState
is added to each cache entry.Now the cache type can delegate all storage to the cache implementation either in the
Result
if it's just the last result that's needed, or using the newState
field for anything else. Both of these are cleaned up by the cache TTL and any future improvements we make to bound agent cache size.This change is made in isolation to the cache package in 718cb2c
Updated Leaf Implementation
This solves several issues with the current CA leaf cache:
It also has a few other improvements:
Notify
mechanism that was added since the original implementation.Misc Extras
A lot of the tests needed to be changed to generate more accurate mock data like certs that actually have the right info in PEM since we now depend on that.
There are a couple of technically unnecessary tweaks in here that I made as I went like populating a bunch more of the
TestCA
fields correctly and making sure the test serial numbers are actually representable in ourCARoot
struct which only has auint64
for the serial number field.Fixes #4479