-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Prototype autoscaling by object count #1187
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a8fcfa4
to
864cb02
Compare
Memory usage is not solely determined by the object count. For instance, 1k objects of the same Kubernetes Kind may consume less memory than 100 objects with various kinds. Additionally, the number of unmanaged cluster objects may also contribute to memory consumption. I feel like it may not be very accurate to have a linear relation between the resource defaults and the object count. |
All of that is of course true. But tracking object count is a lot easier than tracking memory and CPU usage. This is partly why I didn't make the steps more granular/smaller and why I left some headroom above the current test usage. Reaslisticly, autoscaling based on actual usage is problematic for multiple reasons:
Autoscaling based on some other metric, like resources under management, allows us to sidestep a lot of the issues with VPA, because the object count is available before syncing (after parsing) and then after syncing the object count in the inventory is easily accessible, even if the count from the source has been reduced. So we can use that to add more resources before a sync and keep the resources high until after the sync and the inventory has been updated. While the changes are less granular and less ideal, it does allow much faster scale up and scale down with significantly less disruption, no single point of failure that affects other controllers, and faster sync latency overall. |
Here's a shot of the usage and limits with this prototype while running You can see that the 256Mi reconciler memory is being rounded up to 416Mi, because of the other containers in the pod and autopilot rounding up. So I might lower the reconciler default even more so it only rounds up to 256Mi. However, this will require reducing the CPU even more as well, which might impact sync latency. |
Here's a new test called TestStressManyRootSyncs that deploys the same number of 1,000 Deployments, but uses 10 RootSyncs to do it, 100 each. You can see that with the autoscaling, these RootSyncs each get 436MiB limit and use 180MiB at most. So with the custom autoscaling each RootSync reconciler is saving 256MiB+ memory, or about 1/3 of the previous default. It would be more, but Autopilot is rounding up a lot. It looks like we could also go lower as well, maybe another 256MiB after rounding, at least with only ~1k objects on the cluster. The story might be different if there were 10k+, but maybe it's ok to require resource overrides on large clusters... |
For comparison, here's the same number of deployments but more reconcilers (20x100). Interestingly, the peak memory use is basically the same as the 10x200 test, even with half the objects per reconciler. It also peaks on the initial deploy, not just the teardown. The drop after the apply is also more pronounced, possible due to memory pressure triggering Go's garbage collection more aggressively. And you can see that the reconcilers that peaked later peaked higher, probably because their watch cache was being populated by more objects on the cluster, even though all the RootSyncs were created around the same time. |
Here's 20x200 (4000 deployments total). With this configuration, some of the reconcilers are reaching the new lower 436Mi memory limit, but it still completed. From this we can clearly see total resources being watched is a significant contributor to memory use, compared to the 10x200 test we can see almost double the amount of memory being used for each reconciler. If this trend continues, I expect that adding more deployments to the cluster will cause the reconcilers to be oomkilled, even though they're no where near the 1k object trigger to bump up the memory usage. Edit: Here's another 20x200 run that didn't hit the limits. |
At 30x200 the test fails to complete. The reconcilers are repeatedly OOMKilled before they can finish syncing. It seems like maybe autoscaling based purely on the number of objects being managed by the current reconciler isn't going to work on larger clusters (5k+ objects in watched resource types). |
So I figured out how to modify the remediator to filter watches by label and added a filter that should only watch resources with the correct Here's another 20x200 with the new filter: |
Turns out, the label selector I used for the previous test was filtering out all objects, but the test still passed because remediation wasn't required for the test to pass. It looks like there aren't any existing labels that can be used for selection, because the only pre-existing label uses the same value for all reconcilers. So instead, I added a new label, or actually, copied an annotation to also be a label: Now, the memory mostly scales by number of objects being synced, rather than also by the number of objects on the cluster. As you can see, they both use about the same amount of memory for each reconciler: less that 150MiB. |
Crunched some of those numbers (with the watch filtering) and got a pretty flat trend line: These numbers are with a very simple Deployment spec replicated many times, so it may not match actual workloads, which will have both bigger (e.g. CRDs) and smaller (e.g. Service) objects. But it seems like a decent place to start. |
f3a469d
to
cd8b14b
Compare
New custom autoscaling logic is a bit more complicated. It uses 1Mi for every 10 objects, on top of a 128Mi base, then rounds down to 256Mi increments, with a 256Mi minimum. It calculates the memory first, then calculates the CPU from the memory, keeping the 1:1 ratio required by autopilot. That said, since this is just for the first container, the resources may still be rounded up by autopilot on all but the latest GKE v1.29+ clusters (created no earlier than v1.26). For now, the requests & limits are the same, but we could change this in the future for new clusters, with a cluster version check. However, I'm not sure how to check that the cluster is using cgroups v2. |
Looks like I missed a critical constraint. Apparently GKE Autopilot has a minimum memory per Pod of 512 MiB, for older clusters that don't support bursting. So I won't be able to get it down to 256MiB unless it's a new cluster on rapid channel. |
5265978
to
7c937ca
Compare
Interesting data point here. With the watch filtering and custom autoscaling, a 672MiB limit still isn't enough to prevent One difference between TestStressManyDeployments and TestStressManyRootSyncs, is that with TestStressManyDeployments, it uses The OOMKill is pretty obvious from the usage drop, but the limit drop is surprising. I think it's a bug. I think the I think I also found a related bug that was causing the |
Ahh. ok. design problem...
Correction: The The actual problem is that the periodic sync status update often updates the status before any inventory events have been received. Normally this is fine, because the inventory size is cached in the applier. But if the reconciler was OOMKilled, it would have restarted and cleared the cache. So the next time it tries to apply, it would wipe out the I think that means we either need to: A is slightly more efficient, but more complicated to coordinate. B is simpler to implement. |
It required a lot of changes, but I managed to get the sync status to persist between reconciler restarts (except errors). To get it to work I refactored the Applier and Updater so the Updater is the one caching sync status (errors, object count, commit). I added another event handling layer so the Applier could punt these to the Updater without needing to keep its own record of errors or object count. Then I added an initialization stage to the Updater so when it runs the first time it pulls the current status from the RSync. Unfortunately, errors can't be parsed after being formatted, so we can keep those, but that's mostly ok because the errors were already being invalidated before each Applier.Apply attempt anyway. When I got that working, I found that my test was still being OOMKilled, even though the memory usage was always bellow the limit. So I had to increase the memory per object to make sure the node had enough memory. The new memory calculation is: The CPU calculation is the same: So for example:
|
So I modified one of the tests to deploy incrementally more Deployments, pruning them all after each step. The test took 2.1hrs, so I probably wont keep it running in the e2e tests, but I might leave it in as skipped to run on-demand. This is on GKE Autopilot v1.27.11-gke.1062000 (like the rest). So no bursting.
|
59b3b4b
to
eaeb1bb
Compare
15243c1
to
b404034
Compare
- Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any). - Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources. - Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only). - Reduce the default resources for the reconciler on autopilot Notes: - Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal. - Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests. Filter remediator watch by labels Add watch label selectors - Add config.k8s.io/owning-inventory as a label to all managed objects - Add label selectors to the watchers used by the remediator and applier - Modify cli-utils to allow optional label selector injection Add more complex autoscaling math Revert metrics expectation changes Fix e2e test with hard coded label expectations WIP resolve test issues - Update TestConflictingDefinitions_NamespaceToRoot to expect a management conflict from the remediator, and for the remediator's conflict to replace the applier's conflict. - De-dupe management conflict errors from remediator & applier. - Fix a bug in the remediator that was dropping management conflict errors when using a watch filter. With the filter, changing the filter labels causes a DELETE event, instead of an UPDATE event, like it did before. The only way to double-check is to query the cluster, and we don't want to do that in the watch handler for performance reasons. So instead, update the remediator queue worker to query the cluster to confirm object deletion. Then also update the worker to be able to add management conflict errors to the same conflictHandler used by the watcher. - Add HasConflictErrors() to the conflictHandler to replace the locked managementConflict bool in the watch.Manager. This will cause the retry run loop to re-apply if there's still a conflict. Is this desired behavior? Or do we only want to re-apply the first time? If so, we may need to cache the errors in the run loop so we can detect when new conflicts are added. Or maybe just expose the whole conflictHandler. - Clean up parse.Run a little. Add resetRetryBackoff method to avoid leaking implementation details to the rest of the parser. - Remove obsolete retry setting, since it wasn't exposed and was previously replaced with a backoff strategy. - Fix the reconciler container CPU calculation to use milli-cpu. - Add ConflictingObjectID to ManagementConflictError to allow de-duping by ID. - Fix KptManagementConflictError to return a ManagementConflictError, not just a ResourceError. The currentManager will be UNKNOWN, but it only seems to be used by reportRootSyncConflicts, which only reports conflicts from the remediator, not the applier. Reduce memory/cpu step size to 128Mi:125m Reduce memory step size to 32Mi This allows scaling down to 256Mi total reconciler Pod memory, not just 512Mi. However, on non-bursting autopilot clusters, the minimum is still 512Mi. Avoid invalid periodic sync status updates Skip sync status updates when the source cache is reset, otherwise the commit that was previously synced will be removed. Refactor Applier to expose an event handler - Use the new event handler to send errors and inventory updates to the caller (Updater) so it can track them without locks in the applier. - Update the Updater to cache the sync status with a shared lock. - Add initialization to the Updater to refresh the sync status using the RootSync/RepoSync status. This should avoid losing status fields after the reconciler is restarted or rescheduled. Unfortunately, this doesn't work for errors, because they can't be parsed back into typed errors. But this shouldn't be a real problem because the applier was invalidating them before each apply anyway. - Move periodic sync status updates into the Updater. - Add a baseFinalizer class to handle collecting destroyer errors. Double memory per object Even though the memory limit seems higher than usage, the reconciler is still being killed by the kernel because the node is out of memory on autopilot. So we need to be a little more defensive about requesting more memory than we need in order to avoid OOMKills on crowded nodes. Add TestStressProfileResourcesByObjectCount Fix e2e tests that revert manual changes In order for drift correction to work now, the object needs the correct owning-inventory label used by the watch filter. Without the label, drift will not be reverted. Add TestStressProfileResourcesByObjectCountWithMultiSync Fix waiting for deployments Use configsync.gke.io/parent-package-id as filter - Filter label key: configsync.gke.io/parent-package-id - PACKAGE_ID: <PACKAGE_ID_FULL[0:53]>-<hex(fnv32a(PACKAGE_ID_FULL))> - PACKAGE_ID_FULL: <NAME>.<NAMESPACE>.<RootSync|RepoSync> - Pre-pad the hex of the hash to 8 characters for consistent length Add configsync.gke.io/package-id label to RSyncs - Use reconciler-manager field manager when updating objects, to distinguish from the reconciler. Otherwise SSA from a parent reconciler managing an RSync will delete any metadata updates made by the reconciler-manager. Fix rebase errors Fix labels in TestKubectl tests Use latest cli-utils watch filters
b404034
to
3b80a65
Compare
@karlkfi: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Notes: