- Add .status.source.objectCount to track number of objects found
after parsing from source (after hydration, if any).
- Add .status.sync.objectCount to track number of objects in the
inventory ResourceGroup's .spec.resources.
- Add resource doubling for the reconciler container when the
object count (greater of sync & source) is at least 1,000
(autopilot only).
- Reduce the default resources for the reconciler on autopilot
Notes:
- Since the applier doesn't send the inventory size over events,
we have to query the ResourceGroup after each inventory update event.
This could be made more efficient with changes to the applier in
cli-utils, but isn't really a big deal.
- Autoscaling is currently very trivial and could be fine tuned,
but this is just a proof of concept to show an MVP and test that
it would work on our tests.
Filter remediator watch by labels
Add watch label selectors
- Add config.k8s.io/owning-inventory as a label to all managed objects
- Add label selectors to the watchers used by the remediator and applier
- Modify cli-utils to allow optional label selector injection
Add more complex autoscaling math
Revert metrics expectation changes
Fix e2e test with hard coded label expectations
WIP resolve test issues
- Update TestConflictingDefinitions_NamespaceToRoot to expect
a management conflict from the remediator, and for the
remediator's conflict to replace the applier's conflict.
- De-dupe management conflict errors from remediator & applier.
- Fix a bug in the remediator that was dropping management
conflict errors when using a watch filter. With the filter,
changing the filter labels causes a DELETE event, instead of
an UPDATE event, like it did before. The only way to
double-check is to query the cluster, and we don't want to do
that in the watch handler for performance reasons. So instead,
update the remediator queue worker to query the cluster to
confirm object deletion. Then also update the worker to be able
to add management conflict errors to the same conflictHandler
used by the watcher.
- Add HasConflictErrors() to the conflictHandler to replace the
locked managementConflict bool in the watch.Manager. This will
cause the retry run loop to re-apply if there's still a conflict.
Is this desired behavior? Or do we only want to re-apply the
first time? If so, we may need to cache the errors in the run
loop so we can detect when new conflicts are added. Or maybe
just expose the whole conflictHandler.
- Clean up parse.Run a little. Add resetRetryBackoff method to
avoid leaking implementation details to the rest of the parser.
- Remove obsolete retry setting, since it wasn't exposed and was
previously replaced with a backoff strategy.
- Fix the reconciler container CPU calculation to use milli-cpu.
- Add ConflictingObjectID to ManagementConflictError to allow
de-duping by ID.
- Fix KptManagementConflictError to return a ManagementConflictError,
not just a ResourceError. The currentManager will be UNKNOWN,
but it only seems to be used by reportRootSyncConflicts, which
only reports conflicts from the remediator, not the applier.
Reduce memory/cpu step size to 128Mi:125m
Reduce memory step size to 32Mi
This allows scaling down to 256Mi total reconciler Pod memory, not
just 512Mi. However, on non-bursting autopilot clusters, the
minimum is still 512Mi.
Avoid invalid periodic sync status updates
Skip sync status updates when the source cache is reset, otherwise
the commit that was previously synced will be removed.
Refactor Applier to expose an event handler
- Use the new event handler to send errors and inventory updates
to the caller (Updater) so it can track them without locks in
the applier.
- Update the Updater to cache the sync status with a shared lock.
- Add initialization to the Updater to refresh the sync status
using the RootSync/RepoSync status. This should avoid losing
status fields after the reconciler is restarted or rescheduled.
Unfortunately, this doesn't work for errors, because they can't
be parsed back into typed errors. But this shouldn't be a real
problem because the applier was invalidating them before each
apply anyway.
- Move periodic sync status updates into the Updater.
- Add a baseFinalizer class to handle collecting destroyer errors.
Double memory per object
Even though the memory limit seems higher than usage, the reconciler
is still being killed by the kernel because the node is out of
memory on autopilot. So we need to be a little more defensive about
requesting more memory than we need in order to avoid OOMKills on
crowded nodes.
Add TestStressProfileResourcesByObjectCount
Fix e2e tests that revert manual changes
In order for drift correction to work now, the object needs the
correct owning-inventory label used by the watch filter.
Without the label, drift will not be reverted.
Add TestStressProfileResourcesByObjectCountWithMultiSync
Fix waiting for deployments
Use configsync.gke.io/parent-package-id as filter
- Filter label key: configsync.gke.io/parent-package-id
- PACKAGE_ID: <PACKAGE_ID_FULL[0:53]>-<hex(fnv32a(PACKAGE_ID_FULL))>
- PACKAGE_ID_FULL: <NAME>.<NAMESPACE>.<RootSync|RepoSync>
- Pre-pad the hex of the hash to 8 characters for consistent length
Add configsync.gke.io/package-id label to RSyncs
- Use reconciler-manager field manager when updating objects, to
distinguish from the reconciler. Otherwise SSA from a parent
reconciler managing an RSync will delete any metadata updates made
by the reconciler-manager.
Fix rebase errors
Fix labels in TestKubectl tests
Use latest cli-utils watch filters