[WIP] Prototype autoscaling by object count #1187

karlkfi · 2024-03-29T19:38:07Z

Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any).
Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources.
Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only).
Reduce the default resources for the reconciler on autopilot

Notes:

Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal.
Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests.

google-oss-prow · 2024-03-29T19:38:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karlkfi · 2024-03-29T19:39:58Z

Example memory use during TestStressMemoryUsageOCI on GKE Autopilot:

nan-yu · 2024-03-29T20:47:29Z

Memory usage is not solely determined by the object count. For instance, 1k objects of the same Kubernetes Kind may consume less memory than 100 objects with various kinds. Additionally, the number of unmanaged cluster objects may also contribute to memory consumption.

I feel like it may not be very accurate to have a linear relation between the resource defaults and the object count.

karlkfi · 2024-03-29T21:00:11Z

All of that is of course true. But tracking object count is a lot easier than tracking memory and CPU usage. This is partly why I didn't make the steps more granular/smaller and why I left some headroom above the current test usage.

Reaslisticly, autoscaling based on actual usage is problematic for multiple reasons:

VPA uses metrics-server, which is a single point of failure, acting as a backend for an APIService, which causes API Discovery disruptions and outages when it's rescheduled or runs out of memory.
We could alternatively use a custom provider that talks to prometheus and/or Cloud Monitoring, but this requires multiple provider implementations and manually configured RBAC/IAM and maybe service account management.
Metrics are trailing/delayed, which is sometimes too late to scale before an OOMKill
If you're at the limit of the CPU/Memory, you can't tell how much it wants to use, only that it wants more. So scaling up once may not be enough. You may have to scale up multiple times, causing significant disruption.
Every time you change the resources, the reconciler pod needs to be evicted and rescheduled, which increases sync latency and reduces responsiveness.

Autoscaling based on some other metric, like resources under management, allows us to sidestep a lot of the issues with VPA, because the object count is available before syncing (after parsing) and then after syncing the object count in the inventory is easily accessible, even if the count from the source has been reduced. So we can use that to add more resources before a sync and keep the resources high until after the sync and the inventory has been updated.

While the changes are less granular and less ideal, it does allow much faster scale up and scale down with significantly less disruption, no single point of failure that affects other controllers, and faster sync latency overall.

karlkfi · 2024-04-01T20:40:56Z

Here's a shot of the usage and limits with this prototype while running --test-features=reconciliation-2 e2e tests, which don't sync a lot of files.

You can see that the 256Mi reconciler memory is being rounded up to 416Mi, because of the other containers in the pod and autopilot rounding up. So I might lower the reconciler default even more so it only rounds up to 256Mi. However, this will require reducing the CPU even more as well, which might impact sync latency.

karlkfi · 2024-04-01T21:34:03Z

As another datapoint, here is TestStressManyDeployments on Autopilot, which uses ~300MiB memory to deploy 1,000 Deployments.

karlkfi · 2024-04-02T00:35:20Z

Here's a new test called TestStressManyRootSyncs that deploys the same number of 1,000 Deployments, but uses 10 RootSyncs to do it, 100 each. You can see that with the autoscaling, these RootSyncs each get 436MiB limit and use 180MiB at most. So with the custom autoscaling each RootSync reconciler is saving 256MiB+ memory, or about 1/3 of the previous default. It would be more, but Autopilot is rounding up a lot.

It looks like we could also go lower as well, maybe another 256MiB after rounding, at least with only ~1k objects on the cluster. The story might be different if there were 10k+, but maybe it's ok to require resource overrides on large clusters...

karlkfi · 2024-04-02T19:07:08Z

Here's the same test with 2k deployments (10x200).

Memory use went up, but not by 100%, more like 70% (180Mi -> 260Mi).

karlkfi · 2024-04-02T20:23:30Z

For comparison, here's the same number of deployments but more reconcilers (20x100).

Interestingly, the peak memory use is basically the same as the 10x200 test, even with half the objects per reconciler. It also peaks on the initial deploy, not just the teardown. The drop after the apply is also more pronounced, possible due to memory pressure triggering Go's garbage collection more aggressively. And you can see that the reconcilers that peaked later peaked higher, probably because their watch cache was being populated by more objects on the cluster, even though all the RootSyncs were created around the same time.

karlkfi · 2024-04-02T21:10:20Z

Here's 20x200 (4000 deployments total).

With this configuration, some of the reconcilers are reaching the new lower 436Mi memory limit, but it still completed. From this we can clearly see total resources being watched is a significant contributor to memory use, compared to the 10x200 test we can see almost double the amount of memory being used for each reconciler.

If this trend continues, I expect that adding more deployments to the cluster will cause the reconcilers to be oomkilled, even though they're no where near the 1k object trigger to bump up the memory usage.

Edit: Here's another 20x200 run that didn't hit the limits.

karlkfi · 2024-04-02T23:39:58Z

At 30x200 the test fails to complete. The reconcilers are repeatedly OOMKilled before they can finish syncing.

It seems like maybe autoscaling based purely on the number of objects being managed by the current reconciler isn't going to work on larger clusters (5k+ objects in watched resource types).

karlkfi · 2024-04-03T03:20:57Z

So I figured out how to modify the remediator to filter watches by label and added a filter that should only watch resources with the correct cli-utils.sigs.k8s.io/inventory-id label for that reconciler. However, I also discovered that the remediator doesn't actually use the Informer library, so it doesn't actually have an object cache. So filtering the remediator didn't affect the memory usage much.

Here's another 20x200 with the new filter:

karlkfi · 2024-04-04T00:40:54Z

Turns out, the label selector I used for the previous test was filtering out all objects, but the test still passed because remediation wasn't required for the test to pass.

It looks like there aren't any existing labels that can be used for selection, because the only pre-existing label uses the same value for all reconcilers.

So instead, I added a new label, or actually, copied an annotation to also be a label: config.k8s.io/owning-inventory. Then I added label selectors for the remediator and applier/destroyer to only watch objects with the label value that matches the current reconciler. This means we can stop watching objects synced by other reconcilers and mostly remove the memory bloat from large numbers of objects on the cluster.

Now, the memory mostly scales by number of objects being synced, rather than also by the number of objects on the cluster.

Here's 20x200:

And here's 30x200:

As you can see, they both use about the same amount of memory for each reconciler: less that 150MiB.

karlkfi · 2024-04-04T01:11:23Z

10x500 seems to cap out a little higher at ~170Mi

karlkfi · 2024-04-04T01:43:32Z

10x700 got up to ~200Mi

karlkfi · 2024-04-04T03:03:10Z

Having quota problems with larger numbers of nodes, so here's 3x990. Keeping it just under the 1000 count, but it obviously doesn't need even half of the limit here. Maxing out at ~215MiB

karlkfi · 2024-04-04T03:29:48Z

2x2000 here. Maxes out at ~275Mi

karlkfi · 2024-04-04T04:00:02Z

1x5000 here. Maxes out at ~525Mi

karlkfi · 2024-04-10T01:08:57Z

Crunched some of those numbers (with the watch filtering) and got a pretty flat trend line:

These numbers are with a very simple Deployment spec replicated many times, so it may not match actual workloads, which will have both bigger (e.g. CRDs) and smaller (e.g. Service) objects. But it seems like a decent place to start.

karlkfi · 2024-04-18T08:49:07Z

New custom autoscaling logic is a bit more complicated.

It uses 1Mi for every 10 objects, on top of a 128Mi base, then rounds down to 256Mi increments, with a 256Mi minimum. It calculates the memory first, then calculates the CPU from the memory, keeping the 1:1 ratio required by autopilot.

That said, since this is just for the first container, the resources may still be rounded up by autopilot on all but the latest GKE v1.29+ clusters (created no earlier than v1.26).

For now, the requests & limits are the same, but we could change this in the future for new clusters, with a cluster version check. However, I'm not sure how to check that the cluster is using cgroups v2.

karlkfi · 2024-04-18T11:06:41Z

Looks like I missed a critical constraint. Apparently GKE Autopilot has a minimum memory per Pod of 512 MiB, for older clusters that don't support bursting. So I won't be able to get it down to 256MiB unless it's a new cluster on rapid channel.

https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#compute-class-min-max

karlkfi · 2024-04-19T02:16:12Z

Interesting data point here. With the watch filtering and custom autoscaling, a 672MiB limit still isn't enough to prevent TestStressManyDeployments from failing with 1x4000 Deployments.

One difference between TestStressManyDeployments and TestStressManyRootSyncs, is that with TestStressManyDeployments, it uses root-sink to manage the Namespace and the Deployments, which means the watches are cluster-scoped, not namespace scoped. But hopefully that doesn't matter with the watch filer.

The OOMKill is pretty obvious from the usage drop, but the limit drop is surprising. I think it's a bug.

I think the status.sync.objectCount is being removed from the RootSync during syncing. This might be due to the periodic sync status updates during syncing. Or might be in a status update prior to syncing. If so, it's possible removing the status.sync.objectCount triggered the reconciler-manager to lower the memory, causing the OOMKill. Needs more investigation.

I think I also found a related bug that was causing the status.sync.commit to be removed after the source cache was reset, because the code that updates status.sync.commit gets the commit from the source cache. I think I fixed that one locally though.

karlkfi · 2024-04-19T02:34:03Z

Ahh. ok. design problem...

The status.sync.objectCount is being updated before each apply attempt, to match the objects in the new status.sync.commit. But that means the status.sync.objectCount is decremented before the objects have been actually deleted. So that triggers scale down when the objectCount was large enough to previously cause scale up.

Since updating the status.sync.objectCount when the status.sync.commit is updated makes intuitive sense, I don't really want to change that behavior. Instead, we need another field to record the inventory size until after the applier has succeeded. Maybe just a status.inventoryCount or status.inventoryObjectCount or status.managedObjectCount?

Correction:

The status.source is being updated before each apply attempt, not the status.sync. The status.sync is being updated asynchronously while applying.

The actual problem is that the periodic sync status update often updates the status before any inventory events have been received. Normally this is fine, because the inventory size is cached in the applier. But if the reconciler was OOMKilled, it would have restarted and cleared the cache. So the next time it tries to apply, it would wipe out the status.sync.objectCount, causing down scaling.

I think that means we either need to:
A) delay concurrent sync status updates until after the first inventory event from the applier (slightly delayed adding/updating of the sync status and syncing condition)
B) look up the inventory size before starting an apply attempt (an extra API call)

A is slightly more efficient, but more complicated to coordinate. B is simpler to implement.

karlkfi · 2024-04-25T01:50:44Z

It required a lot of changes, but I managed to get the sync status to persist between reconciler restarts (except errors).

To get it to work I refactored the Applier and Updater so the Updater is the one caching sync status (errors, object count, commit). I added another event handling layer so the Applier could punt these to the Updater without needing to keep its own record of errors or object count. Then I added an initialization stage to the Updater so when it runs the first time it pulls the current status from the RSync.

Unfortunately, errors can't be parsed after being formatted, so we can keep those, but that's mostly ok because the errors were already being invalidated before each Applier.Apply attempt anyway.

When I got that working, I found that my test was still being OOMKilled, even though the memory usage was always bellow the limit. So I had to increase the memory per object to make sure the node had enough memory.

The new memory calculation is: Max( RoundUpToNearestMultiple( (object count / 5 ) + MIN_MEM, STEP_SIZE ), MIN_MEM ) with MIN_MEM=128 and STEP_SIZE=32

The CPU calculation is the same: Max( Ceil( MEM_LIMIT * CPU_RATIO ), CPU_MIN ) with CPU_MIN=125 and CPU_RATIO=1000/1024

So for example:

if object count is 0, reconciler memory limit is 128MiB, and cpu limit is 125m
if object count is 3000, reconciler memory limit is 736MiB, and cpu limit is 720m

karlkfi · 2024-04-25T04:30:50Z

So I modified one of the tests to deploy incrementally more Deployments, pruning them all after each step.

The test took 2.1hrs, so I probably wont keep it running in the e2e tests, but I might leave it in as skipped to run on-demand.

This is on GKE Autopilot v1.27.11-gke.1062000 (like the rest). So no bursting.

Deployments	Mem Usage	Mem Limit	CPU Usage	CPU Limit
500	211	416	0.096	0.47
1000	289	416	0.21	0.47
1500	386	448	0.193	0.47
2000	495	672	0.218	0.72
2500	561	672	0.215	0.72
3000	603	736	0.341	0.72
3500	709	928	0.307	0.97
4000	755	960	0.386	0.97

- Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any). - Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources. - Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only). - Reduce the default resources for the reconciler on autopilot Notes: - Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal. - Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests. Filter remediator watch by labels Add watch label selectors - Add config.k8s.io/owning-inventory as a label to all managed objects - Add label selectors to the watchers used by the remediator and applier - Modify cli-utils to allow optional label selector injection Add more complex autoscaling math Revert metrics expectation changes Fix e2e test with hard coded label expectations WIP resolve test issues - Update TestConflictingDefinitions_NamespaceToRoot to expect a management conflict from the remediator, and for the remediator's conflict to replace the applier's conflict. - De-dupe management conflict errors from remediator & applier. - Fix a bug in the remediator that was dropping management conflict errors when using a watch filter. With the filter, changing the filter labels causes a DELETE event, instead of an UPDATE event, like it did before. The only way to double-check is to query the cluster, and we don't want to do that in the watch handler for performance reasons. So instead, update the remediator queue worker to query the cluster to confirm object deletion. Then also update the worker to be able to add management conflict errors to the same conflictHandler used by the watcher. - Add HasConflictErrors() to the conflictHandler to replace the locked managementConflict bool in the watch.Manager. This will cause the retry run loop to re-apply if there's still a conflict. Is this desired behavior? Or do we only want to re-apply the first time? If so, we may need to cache the errors in the run loop so we can detect when new conflicts are added. Or maybe just expose the whole conflictHandler. - Clean up parse.Run a little. Add resetRetryBackoff method to avoid leaking implementation details to the rest of the parser. - Remove obsolete retry setting, since it wasn't exposed and was previously replaced with a backoff strategy. - Fix the reconciler container CPU calculation to use milli-cpu. - Add ConflictingObjectID to ManagementConflictError to allow de-duping by ID. - Fix KptManagementConflictError to return a ManagementConflictError, not just a ResourceError. The currentManager will be UNKNOWN, but it only seems to be used by reportRootSyncConflicts, which only reports conflicts from the remediator, not the applier. Reduce memory/cpu step size to 128Mi:125m Reduce memory step size to 32Mi This allows scaling down to 256Mi total reconciler Pod memory, not just 512Mi. However, on non-bursting autopilot clusters, the minimum is still 512Mi. Avoid invalid periodic sync status updates Skip sync status updates when the source cache is reset, otherwise the commit that was previously synced will be removed. Refactor Applier to expose an event handler - Use the new event handler to send errors and inventory updates to the caller (Updater) so it can track them without locks in the applier. - Update the Updater to cache the sync status with a shared lock. - Add initialization to the Updater to refresh the sync status using the RootSync/RepoSync status. This should avoid losing status fields after the reconciler is restarted or rescheduled. Unfortunately, this doesn't work for errors, because they can't be parsed back into typed errors. But this shouldn't be a real problem because the applier was invalidating them before each apply anyway. - Move periodic sync status updates into the Updater. - Add a baseFinalizer class to handle collecting destroyer errors. Double memory per object Even though the memory limit seems higher than usage, the reconciler is still being killed by the kernel because the node is out of memory on autopilot. So we need to be a little more defensive about requesting more memory than we need in order to avoid OOMKills on crowded nodes. Add TestStressProfileResourcesByObjectCount Fix e2e tests that revert manual changes In order for drift correction to work now, the object needs the correct owning-inventory label used by the watch filter. Without the label, drift will not be reverted. Add TestStressProfileResourcesByObjectCountWithMultiSync Fix waiting for deployments Use configsync.gke.io/parent-package-id as filter - Filter label key: configsync.gke.io/parent-package-id - PACKAGE_ID: <PACKAGE_ID_FULL[0:53]>-<hex(fnv32a(PACKAGE_ID_FULL))> - PACKAGE_ID_FULL: <NAME>.<NAMESPACE>.<RootSync|RepoSync> - Pre-pad the hex of the hash to 8 characters for consistent length Add configsync.gke.io/package-id label to RSyncs - Use reconciler-manager field manager when updating objects, to distinguish from the reconciler. Otherwise SSA from a parent reconciler managing an RSync will delete any metadata updates made by the reconciler-manager. Fix rebase errors Fix labels in TestKubectl tests Use latest cli-utils watch filters

google-oss-prow · 2024-05-28T20:32:52Z

@karlkfi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kpt-config-sync-presubmit	`3b80a65`	link	true	`/test kpt-config-sync-presubmit`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 29, 2024

google-oss-prow bot requested review from haiyanmeng and nan-yu March 29, 2024 19:38

google-oss-prow bot added the size/L label Mar 29, 2024

karlkfi force-pushed the karl-autoscale-count branch from a8fcfa4 to 864cb02 Compare March 29, 2024 19:44

karlkfi requested review from sdowell and removed request for haiyanmeng March 29, 2024 19:44

google-oss-prow bot added size/XL and removed size/L labels Apr 4, 2024

google-oss-prow bot added size/XXL and removed size/XL labels Apr 18, 2024

karlkfi force-pushed the karl-autoscale-count branch from f3a469d to cd8b14b Compare April 18, 2024 08:40

karlkfi force-pushed the karl-autoscale-count branch from 5265978 to 7c937ca Compare April 18, 2024 11:19

tiffanny29631 mentioned this pull request May 7, 2024

Add memory test doc for 1.18 #1196

Merged

karlkfi force-pushed the karl-autoscale-count branch 4 times, most recently from 59b3b4b to eaeb1bb Compare May 8, 2024 22:33

karlkfi force-pushed the karl-autoscale-count branch 2 times, most recently from 15243c1 to b404034 Compare May 25, 2024 00:34

karlkfi force-pushed the karl-autoscale-count branch from b404034 to 3b80a65 Compare May 28, 2024 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Prototype autoscaling by object count #1187

[WIP] Prototype autoscaling by object count #1187

karlkfi commented Mar 29, 2024

google-oss-prow bot commented Mar 29, 2024

karlkfi commented Mar 29, 2024

nan-yu commented Mar 29, 2024

karlkfi commented Mar 29, 2024

karlkfi commented Apr 1, 2024

karlkfi commented Apr 1, 2024 •

edited

Loading

karlkfi commented Apr 2, 2024

karlkfi commented Apr 2, 2024

karlkfi commented Apr 2, 2024

karlkfi commented Apr 2, 2024 •

edited

Loading

karlkfi commented Apr 2, 2024

karlkfi commented Apr 3, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 10, 2024

karlkfi commented Apr 18, 2024

karlkfi commented Apr 18, 2024 •

edited

Loading

karlkfi commented Apr 19, 2024

karlkfi commented Apr 19, 2024 •

edited

Loading

karlkfi commented Apr 25, 2024

karlkfi commented Apr 25, 2024

google-oss-prow bot commented May 28, 2024 •

edited

Loading

[WIP] Prototype autoscaling by object count #1187

Are you sure you want to change the base?

[WIP] Prototype autoscaling by object count #1187

Conversation

karlkfi commented Mar 29, 2024

google-oss-prow bot commented Mar 29, 2024

karlkfi commented Mar 29, 2024

nan-yu commented Mar 29, 2024

karlkfi commented Mar 29, 2024

karlkfi commented Apr 1, 2024

karlkfi commented Apr 1, 2024 • edited Loading

karlkfi commented Apr 2, 2024

karlkfi commented Apr 2, 2024

karlkfi commented Apr 2, 2024

karlkfi commented Apr 2, 2024 • edited Loading

karlkfi commented Apr 2, 2024

karlkfi commented Apr 3, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 4, 2024

karlkfi commented Apr 10, 2024

karlkfi commented Apr 18, 2024

karlkfi commented Apr 18, 2024 • edited Loading

karlkfi commented Apr 19, 2024

karlkfi commented Apr 19, 2024 • edited Loading

karlkfi commented Apr 25, 2024

karlkfi commented Apr 25, 2024

google-oss-prow bot commented May 28, 2024 • edited Loading

karlkfi commented Apr 1, 2024 •

edited

Loading

karlkfi commented Apr 2, 2024 •

edited

Loading

karlkfi commented Apr 18, 2024 •

edited

Loading

karlkfi commented Apr 19, 2024 •

edited

Loading

google-oss-prow bot commented May 28, 2024 •

edited

Loading