[WIP] Prototype autoscaling by object count #1187

- Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any). - Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources. - Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only). - Reduce the default resources for the reconciler on autopilot Notes: - Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal. - Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests. Filter remediator watch by labels Add watch label selectors - Add config.k8s.io/owning-inventory as a label to all managed objects - Add label selectors to the watchers used by the remediator and applier - Modify cli-utils to allow optional label selector injection Add more complex autoscaling math Revert metrics expectation changes Fix e2e test with hard coded label expectations WIP resolve test issues - Update TestConflictingDefinitions_NamespaceToRoot to expect a management conflict from the remediator, and for the remediator's conflict to replace the applier's conflict. - De-dupe management conflict errors from remediator & applier. - Fix a bug in the remediator that was dropping management conflict errors when using a watch filter. With the filter, changing the filter labels causes a DELETE event, instead of an UPDATE event, like it did before. The only way to double-check is to query the cluster, and we don't want to do that in the watch handler for performance reasons. So instead, update the remediator queue worker to query the cluster to confirm object deletion. Then also update the worker to be able to add management conflict errors to the same conflictHandler used by the watcher. - Add HasConflictErrors() to the conflictHandler to replace the locked managementConflict bool in the watch.Manager. This will cause the retry run loop to re-apply if there's still a conflict. Is this desired behavior? Or do we only want to re-apply the first time? If so, we may need to cache the errors in the run loop so we can detect when new conflicts are added. Or maybe just expose the whole conflictHandler. - Clean up parse.Run a little. Add resetRetryBackoff method to avoid leaking implementation details to the rest of the parser. - Remove obsolete retry setting, since it wasn't exposed and was previously replaced with a backoff strategy. - Fix the reconciler container CPU calculation to use milli-cpu. - Add ConflictingObjectID to ManagementConflictError to allow de-duping by ID. - Fix KptManagementConflictError to return a ManagementConflictError, not just a ResourceError. The currentManager will be UNKNOWN, but it only seems to be used by reportRootSyncConflicts, which only reports conflicts from the remediator, not the applier. Reduce memory/cpu step size to 128Mi:125m Reduce memory step size to 32Mi This allows scaling down to 256Mi total reconciler Pod memory, not just 512Mi. However, on non-bursting autopilot clusters, the minimum is still 512Mi. Avoid invalid periodic sync status updates Skip sync status updates when the source cache is reset, otherwise the commit that was previously synced will be removed. Refactor Applier to expose an event handler - Use the new event handler to send errors and inventory updates to the caller (Updater) so it can track them without locks in the applier. - Update the Updater to cache the sync status with a shared lock. - Add initialization to the Updater to refresh the sync status using the RootSync/RepoSync status. This should avoid losing status fields after the reconciler is restarted or rescheduled. Unfortunately, this doesn't work for errors, because they can't be parsed back into typed errors. But this shouldn't be a real problem because the applier was invalidating them before each apply anyway. - Move periodic sync status updates into the Updater. - Add a baseFinalizer class to handle collecting destroyer errors. Double memory per object Even though the memory limit seems higher than usage, the reconciler is still being killed by the kernel because the node is out of memory on autopilot. So we need to be a little more defensive about requesting more memory than we need in order to avoid OOMKills on crowded nodes. Add TestStressProfileResourcesByObjectCount Fix e2e tests that revert manual changes In order for drift correction to work now, the object needs the correct owning-inventory label used by the watch filter. Without the label, drift will not be reverted. Add TestStressProfileResourcesByObjectCountWithMultiSync Fix waiting for deployments Use configsync.gke.io/parent-package-id as filter - Filter label key: configsync.gke.io/parent-package-id - PACKAGE_ID: <PACKAGE_ID_FULL[0:53]>-<hex(fnv32a(PACKAGE_ID_FULL))> - PACKAGE_ID_FULL: <NAME>.<NAMESPACE>.<RootSync|RepoSync> - Pre-pad the hex of the hash to 8 characters for consistent length Add configsync.gke.io/package-id label to RSyncs - Use reconciler-manager field manager when updating objects, to distinguish from the reconciler. Otherwise SSA from a parent reconciler managing an RSync will delete any metadata updates made by the reconciler-manager. Fix rebase errors Fix labels in TestKubectl tests Use latest cli-utils watch filters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Prototype autoscaling by object count #1187

[WIP] Prototype autoscaling by object count #1187

Commits on May 28, 2024

[WIP] Prototype autoscaling by object count #1187

Are you sure you want to change the base?

[WIP] Prototype autoscaling by object count #1187

Commits on May 28, 2024