Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prototype autoscaling by object count #1187

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Mar 29, 2024

  • Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any).
  • Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources.
  • Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only).
  • Reduce the default resources for the reconciler on autopilot

Notes:

  • Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal.
  • Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karlkfi
Copy link
Contributor Author

karlkfi commented Mar 29, 2024

Example memory use during TestStressMemoryUsageOCI on GKE Autopilot:
Screenshot from 2024-03-29 12-26-13

@karlkfi karlkfi requested review from sdowell and removed request for haiyanmeng March 29, 2024 19:44
@nan-yu
Copy link
Contributor

nan-yu commented Mar 29, 2024

Memory usage is not solely determined by the object count. For instance, 1k objects of the same Kubernetes Kind may consume less memory than 100 objects with various kinds. Additionally, the number of unmanaged cluster objects may also contribute to memory consumption.

I feel like it may not be very accurate to have a linear relation between the resource defaults and the object count.

@karlkfi
Copy link
Contributor Author

karlkfi commented Mar 29, 2024

All of that is of course true. But tracking object count is a lot easier than tracking memory and CPU usage. This is partly why I didn't make the steps more granular/smaller and why I left some headroom above the current test usage.

Reaslisticly, autoscaling based on actual usage is problematic for multiple reasons:

  • VPA uses metrics-server, which is a single point of failure, acting as a backend for an APIService, which causes API Discovery disruptions and outages when it's rescheduled or runs out of memory.
  • We could alternatively use a custom provider that talks to prometheus and/or Cloud Monitoring, but this requires multiple provider implementations and manually configured RBAC/IAM and maybe service account management.
  • Metrics are trailing/delayed, which is sometimes too late to scale before an OOMKill
  • If you're at the limit of the CPU/Memory, you can't tell how much it wants to use, only that it wants more. So scaling up once may not be enough. You may have to scale up multiple times, causing significant disruption.
  • Every time you change the resources, the reconciler pod needs to be evicted and rescheduled, which increases sync latency and reduces responsiveness.

Autoscaling based on some other metric, like resources under management, allows us to sidestep a lot of the issues with VPA, because the object count is available before syncing (after parsing) and then after syncing the object count in the inventory is easily accessible, even if the count from the source has been reduced. So we can use that to add more resources before a sync and keep the resources high until after the sync and the inventory has been updated.

While the changes are less granular and less ideal, it does allow much faster scale up and scale down with significantly less disruption, no single point of failure that affects other controllers, and faster sync latency overall.

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 1, 2024

Here's a shot of the usage and limits with this prototype while running --test-features=reconciliation-2 e2e tests, which don't sync a lot of files.

You can see that the 256Mi reconciler memory is being rounded up to 416Mi, because of the other containers in the pod and autopilot rounding up. So I might lower the reconciler default even more so it only rounds up to 256Mi. However, this will require reducing the CPU even more as well, which might impact sync latency.

Screenshot from 2024-04-01 11-21-19

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 1, 2024

As another datapoint, here is TestStressManyDeployments on Autopilot, which uses ~300MiB memory to deploy 1,000 Deployments.

Screenshot from 2024-04-01 14-32-24

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 2, 2024

Here's a new test called TestStressManyRootSyncs that deploys the same number of 1,000 Deployments, but uses 10 RootSyncs to do it, 100 each. You can see that with the autoscaling, these RootSyncs each get 436MiB limit and use 180MiB at most. So with the custom autoscaling each RootSync reconciler is saving 256MiB+ memory, or about 1/3 of the previous default. It would be more, but Autopilot is rounding up a lot.

It looks like we could also go lower as well, maybe another 256MiB after rounding, at least with only ~1k objects on the cluster. The story might be different if there were 10k+, but maybe it's ok to require resource overrides on large clusters...

Screenshot from 2024-04-01 17-28-39

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 2, 2024

Here's the same test with 2k deployments (10x200).

Memory use went up, but not by 100%, more like 70% (180Mi -> 260Mi).

Screenshot from 2024-04-02 12-04-21

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 2, 2024

For comparison, here's the same number of deployments but more reconcilers (20x100).

Interestingly, the peak memory use is basically the same as the 10x200 test, even with half the objects per reconciler. It also peaks on the initial deploy, not just the teardown. The drop after the apply is also more pronounced, possible due to memory pressure triggering Go's garbage collection more aggressively. And you can see that the reconcilers that peaked later peaked higher, probably because their watch cache was being populated by more objects on the cluster, even though all the RootSyncs were created around the same time.

Screenshot from 2024-04-02 13-17-50

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 2, 2024

Here's 20x200 (4000 deployments total).

With this configuration, some of the reconcilers are reaching the new lower 436Mi memory limit, but it still completed. From this we can clearly see total resources being watched is a significant contributor to memory use, compared to the 10x200 test we can see almost double the amount of memory being used for each reconciler.

If this trend continues, I expect that adding more deployments to the cluster will cause the reconcilers to be oomkilled, even though they're no where near the 1k object trigger to bump up the memory usage.

Screenshot from 2024-04-02 14-05-02

Edit: Here's another 20x200 run that didn't hit the limits.

Screenshot from 2024-04-02 18-08-45

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 2, 2024

At 30x200 the test fails to complete. The reconcilers are repeatedly OOMKilled before they can finish syncing.

It seems like maybe autoscaling based purely on the number of objects being managed by the current reconciler isn't going to work on larger clusters (5k+ objects in watched resource types).

Screenshot from 2024-04-02 16-34-25

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 3, 2024

So I figured out how to modify the remediator to filter watches by label and added a filter that should only watch resources with the correct cli-utils.sigs.k8s.io/inventory-id label for that reconciler. However, I also discovered that the remediator doesn't actually use the Informer library, so it doesn't actually have an object cache. So filtering the remediator didn't affect the memory usage much.

Here's another 20x200 with the new filter:

Screenshot from 2024-04-02 20-16-42

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Apr 4, 2024
@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 4, 2024

Turns out, the label selector I used for the previous test was filtering out all objects, but the test still passed because remediation wasn't required for the test to pass.

It looks like there aren't any existing labels that can be used for selection, because the only pre-existing label uses the same value for all reconcilers.

So instead, I added a new label, or actually, copied an annotation to also be a label: config.k8s.io/owning-inventory. Then I added label selectors for the remediator and applier/destroyer to only watch objects with the label value that matches the current reconciler. This means we can stop watching objects synced by other reconcilers and mostly remove the memory bloat from large numbers of objects on the cluster.

Now, the memory mostly scales by number of objects being synced, rather than also by the number of objects on the cluster.

Here's 20x200:
Screenshot from 2024-04-03 15-24-17

And here's 30x200:
Screenshot from 2024-04-03 17-29-22

As you can see, they both use about the same amount of memory for each reconciler: less that 150MiB.

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 4, 2024

10x500 seems to cap out a little higher at ~170Mi
Screenshot from 2024-04-03 18-09-51

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 4, 2024

10x700 got up to ~200Mi
Screenshot from 2024-04-03 18-43-07

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 4, 2024

Having quota problems with larger numbers of nodes, so here's 3x990. Keeping it just under the 1000 count, but it obviously doesn't need even half of the limit here. Maxing out at ~215MiB
Screenshot from 2024-04-03 20-01-25

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 4, 2024

2x2000 here. Maxes out at ~275Mi
Screenshot from 2024-04-03 20-29-02

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 4, 2024

1x5000 here. Maxes out at ~525Mi
Screenshot from 2024-04-03 20-59-20

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 10, 2024

Crunched some of those numbers (with the watch filtering) and got a pretty flat trend line:

Screenshot from 2024-04-09 18-05-39

These numbers are with a very simple Deployment spec replicated many times, so it may not match actual workloads, which will have both bigger (e.g. CRDs) and smaller (e.g. Service) objects. But it seems like a decent place to start.

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 18, 2024

New custom autoscaling logic is a bit more complicated.

It uses 1Mi for every 10 objects, on top of a 128Mi base, then rounds down to 256Mi increments, with a 256Mi minimum. It calculates the memory first, then calculates the CPU from the memory, keeping the 1:1 ratio required by autopilot.

That said, since this is just for the first container, the resources may still be rounded up by autopilot on all but the latest GKE v1.29+ clusters (created no earlier than v1.26).

For now, the requests & limits are the same, but we could change this in the future for new clusters, with a cluster version check. However, I'm not sure how to check that the cluster is using cgroups v2.

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 18, 2024

Looks like I missed a critical constraint. Apparently GKE Autopilot has a minimum memory per Pod of 512 MiB, for older clusters that don't support bursting. So I won't be able to get it down to 256MiB unless it's a new cluster on rapid channel.

https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#compute-class-min-max

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 19, 2024

Interesting data point here. With the watch filtering and custom autoscaling, a 672MiB limit still isn't enough to prevent TestStressManyDeployments from failing with 1x4000 Deployments.

One difference between TestStressManyDeployments and TestStressManyRootSyncs, is that with TestStressManyDeployments, it uses root-sink to manage the Namespace and the Deployments, which means the watches are cluster-scoped, not namespace scoped. But hopefully that doesn't matter with the watch filer.

reconciler-watch-filter-1x4000-oomkill

The OOMKill is pretty obvious from the usage drop, but the limit drop is surprising. I think it's a bug.

I think the status.sync.objectCount is being removed from the RootSync during syncing. This might be due to the periodic sync status updates during syncing. Or might be in a status update prior to syncing. If so, it's possible removing the status.sync.objectCount triggered the reconciler-manager to lower the memory, causing the OOMKill. Needs more investigation.

I think I also found a related bug that was causing the status.sync.commit to be removed after the source cache was reset, because the code that updates status.sync.commit gets the commit from the source cache. I think I fixed that one locally though.

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 19, 2024

Ahh. ok. design problem...

The status.sync.objectCount is being updated before each apply attempt, to match the objects in the new status.sync.commit. But that means the status.sync.objectCount is decremented before the objects have been actually deleted. So that triggers scale down when the objectCount was large enough to previously cause scale up.

Since updating the status.sync.objectCount when the status.sync.commit is updated makes intuitive sense, I don't really want to change that behavior. Instead, we need another field to record the inventory size until after the applier has succeeded. Maybe just a status.inventoryCount or status.inventoryObjectCount or status.managedObjectCount?

Correction:

The status.source is being updated before each apply attempt, not the status.sync. The status.sync is being updated asynchronously while applying.

The actual problem is that the periodic sync status update often updates the status before any inventory events have been received. Normally this is fine, because the inventory size is cached in the applier. But if the reconciler was OOMKilled, it would have restarted and cleared the cache. So the next time it tries to apply, it would wipe out the status.sync.objectCount, causing down scaling.

I think that means we either need to:
A) delay concurrent sync status updates until after the first inventory event from the applier (slightly delayed adding/updating of the sync status and syncing condition)
B) look up the inventory size before starting an apply attempt (an extra API call)

A is slightly more efficient, but more complicated to coordinate. B is simpler to implement.

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 25, 2024

It required a lot of changes, but I managed to get the sync status to persist between reconciler restarts (except errors).

To get it to work I refactored the Applier and Updater so the Updater is the one caching sync status (errors, object count, commit). I added another event handling layer so the Applier could punt these to the Updater without needing to keep its own record of errors or object count. Then I added an initialization stage to the Updater so when it runs the first time it pulls the current status from the RSync.

Unfortunately, errors can't be parsed after being formatted, so we can keep those, but that's mostly ok because the errors were already being invalidated before each Applier.Apply attempt anyway.

When I got that working, I found that my test was still being OOMKilled, even though the memory usage was always bellow the limit. So I had to increase the memory per object to make sure the node had enough memory.

The new memory calculation is: Max( RoundUpToNearestMultiple( (object count / 5 ) + MIN_MEM, STEP_SIZE ), MIN_MEM ) with MIN_MEM=128 and STEP_SIZE=32

The CPU calculation is the same: Max( Ceil( MEM_LIMIT * CPU_RATIO ), CPU_MIN ) with CPU_MIN=125 and CPU_RATIO=1000/1024

So for example:

  • if object count is 0, reconciler memory limit is 128MiB, and cpu limit is 125m
  • if object count is 3000, reconciler memory limit is 736MiB, and cpu limit is 720m

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 25, 2024

So I modified one of the tests to deploy incrementally more Deployments, pruning them all after each step.

The test took 2.1hrs, so I probably wont keep it running in the e2e tests, but I might leave it in as skipped to run on-demand.

This is on GKE Autopilot v1.27.11-gke.1062000 (like the rest). So no bursting.

Deployments Mem Usage Mem Limit CPU Usage CPU Limit
500 211 416 0.096 0.47
1000 289 416 0.21 0.47
1500 386 448 0.193 0.47
2000 495 672 0.218 0.72
2500 561 672 0.215 0.72
3000 603 736 0.341 0.72
3500 709 928 0.307 0.97
4000 755 960 0.386 0.97

Screenshot from 2024-04-24 21-30-00

Screenshot from 2024-04-24 21-05-09

@karlkfi karlkfi force-pushed the karl-autoscale-count branch 4 times, most recently from 59b3b4b to eaeb1bb Compare May 8, 2024 22:33
@karlkfi karlkfi force-pushed the karl-autoscale-count branch 2 times, most recently from 15243c1 to b404034 Compare May 25, 2024 00:34
- Add .status.source.objectCount to track number of objects found
  after parsing from source (after hydration, if any).
- Add .status.sync.objectCount to track number of objects in the
  inventory ResourceGroup's .spec.resources.
- Add resource doubling for the reconciler container when the
  object count (greater of sync & source) is at least 1,000
  (autopilot only).
- Reduce the default resources for the reconciler on autopilot

Notes:
- Since the applier doesn't send the inventory size over events,
  we have to query the ResourceGroup after each inventory update event.
  This could be made more efficient with changes to the applier in
  cli-utils, but isn't really a big deal.
- Autoscaling is currently very trivial and could be fine tuned,
  but this is just a proof of concept to show an MVP and test that
  it would work on our tests.

Filter remediator watch by labels

Add watch label selectors

- Add config.k8s.io/owning-inventory as a label to all managed objects
- Add label selectors to the watchers used by the remediator and applier
- Modify cli-utils to allow optional label selector injection

Add more complex autoscaling math

Revert metrics expectation changes

Fix e2e test with hard coded label expectations

WIP resolve test issues

- Update TestConflictingDefinitions_NamespaceToRoot to expect
  a management conflict from the remediator, and for the
  remediator's conflict to replace the applier's conflict.
- De-dupe management conflict errors from remediator & applier.
- Fix a bug in the remediator that was dropping management
  conflict errors when using a watch filter. With the filter,
  changing the filter labels causes a DELETE event, instead of
  an UPDATE event, like it did before. The only way to
  double-check is to query the cluster, and we don't want to do
  that in the watch handler for performance reasons. So instead,
  update the remediator queue worker to query the cluster to
  confirm object deletion. Then also update the worker to be able
  to add management conflict errors to the same conflictHandler
  used by the watcher.
- Add HasConflictErrors() to the conflictHandler to replace the
  locked managementConflict bool in the watch.Manager. This will
  cause the retry run loop to re-apply if there's still a conflict.
  Is this desired behavior? Or do we only want to re-apply the
  first time? If so, we may need to cache the errors in the run
  loop so we can detect when new conflicts are added. Or maybe
  just expose the whole conflictHandler.
- Clean up parse.Run a little. Add resetRetryBackoff method to
  avoid leaking implementation details to the rest of the parser.
- Remove obsolete retry setting, since it wasn't exposed and was
  previously replaced with a backoff strategy.
- Fix the reconciler container CPU calculation to use milli-cpu.
- Add ConflictingObjectID to ManagementConflictError to allow
  de-duping by ID.
- Fix KptManagementConflictError to return a ManagementConflictError,
  not just a ResourceError. The currentManager will be UNKNOWN,
  but it only seems to be used by reportRootSyncConflicts, which
  only reports conflicts from the remediator, not the applier.

Reduce memory/cpu step size to 128Mi:125m

Reduce memory step size to 32Mi

This allows scaling down to 256Mi total reconciler Pod memory, not
just 512Mi. However, on non-bursting autopilot clusters, the
minimum is still 512Mi.

Avoid invalid periodic sync status updates

Skip sync status updates when the source cache is reset, otherwise
the commit that was previously synced will be removed.

Refactor Applier to expose an event handler

- Use the new event handler to send errors and inventory updates
  to the caller (Updater) so it can track them without locks in
  the applier.
- Update the Updater to cache the sync status with a shared lock.
- Add initialization to the Updater to refresh the sync status
  using the RootSync/RepoSync status. This should avoid losing
  status fields after the reconciler is restarted or rescheduled.
  Unfortunately, this doesn't work for errors, because they can't
  be parsed back into typed errors. But this shouldn't be a real
  problem because the applier was invalidating them before each
  apply anyway.
- Move periodic sync status updates into the Updater.
- Add a baseFinalizer class to handle collecting destroyer errors.

Double memory per object

Even though the memory limit seems higher than usage, the reconciler
is still being killed by the kernel because the node is out of
memory on autopilot. So we need to be a little more defensive about
requesting more memory than we need in order to avoid OOMKills on
crowded nodes.

Add TestStressProfileResourcesByObjectCount

Fix e2e tests that revert manual changes

In order for drift correction to work now, the object needs the
correct owning-inventory label used by the watch filter.
Without the label, drift will not be reverted.

Add TestStressProfileResourcesByObjectCountWithMultiSync

Fix waiting for deployments

Use configsync.gke.io/parent-package-id as filter

- Filter label key: configsync.gke.io/parent-package-id
- PACKAGE_ID: <PACKAGE_ID_FULL[0:53]>-<hex(fnv32a(PACKAGE_ID_FULL))>
- PACKAGE_ID_FULL: <NAME>.<NAMESPACE>.<RootSync|RepoSync>
- Pre-pad the hex of the hash to 8 characters for consistent length

Add configsync.gke.io/package-id label to RSyncs

- Use reconciler-manager field manager when updating objects, to
  distinguish from the reconciler. Otherwise SSA from a parent
  reconciler managing an RSync will delete any metadata updates made
  by the reconciler-manager.

Fix rebase errors

Fix labels in TestKubectl tests

Use latest cli-utils watch filters
Copy link

google-oss-prow bot commented May 28, 2024

@karlkfi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
kpt-config-sync-presubmit 3b80a65 link true /test kpt-config-sync-presubmit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants