Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prototype autoscaling by object count #1187

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on May 28, 2024

  1. [WIP] Prototype autoscaling by object count

    - Add .status.source.objectCount to track number of objects found
      after parsing from source (after hydration, if any).
    - Add .status.sync.objectCount to track number of objects in the
      inventory ResourceGroup's .spec.resources.
    - Add resource doubling for the reconciler container when the
      object count (greater of sync & source) is at least 1,000
      (autopilot only).
    - Reduce the default resources for the reconciler on autopilot
    
    Notes:
    - Since the applier doesn't send the inventory size over events,
      we have to query the ResourceGroup after each inventory update event.
      This could be made more efficient with changes to the applier in
      cli-utils, but isn't really a big deal.
    - Autoscaling is currently very trivial and could be fine tuned,
      but this is just a proof of concept to show an MVP and test that
      it would work on our tests.
    
    Filter remediator watch by labels
    
    Add watch label selectors
    
    - Add config.k8s.io/owning-inventory as a label to all managed objects
    - Add label selectors to the watchers used by the remediator and applier
    - Modify cli-utils to allow optional label selector injection
    
    Add more complex autoscaling math
    
    Revert metrics expectation changes
    
    Fix e2e test with hard coded label expectations
    
    WIP resolve test issues
    
    - Update TestConflictingDefinitions_NamespaceToRoot to expect
      a management conflict from the remediator, and for the
      remediator's conflict to replace the applier's conflict.
    - De-dupe management conflict errors from remediator & applier.
    - Fix a bug in the remediator that was dropping management
      conflict errors when using a watch filter. With the filter,
      changing the filter labels causes a DELETE event, instead of
      an UPDATE event, like it did before. The only way to
      double-check is to query the cluster, and we don't want to do
      that in the watch handler for performance reasons. So instead,
      update the remediator queue worker to query the cluster to
      confirm object deletion. Then also update the worker to be able
      to add management conflict errors to the same conflictHandler
      used by the watcher.
    - Add HasConflictErrors() to the conflictHandler to replace the
      locked managementConflict bool in the watch.Manager. This will
      cause the retry run loop to re-apply if there's still a conflict.
      Is this desired behavior? Or do we only want to re-apply the
      first time? If so, we may need to cache the errors in the run
      loop so we can detect when new conflicts are added. Or maybe
      just expose the whole conflictHandler.
    - Clean up parse.Run a little. Add resetRetryBackoff method to
      avoid leaking implementation details to the rest of the parser.
    - Remove obsolete retry setting, since it wasn't exposed and was
      previously replaced with a backoff strategy.
    - Fix the reconciler container CPU calculation to use milli-cpu.
    - Add ConflictingObjectID to ManagementConflictError to allow
      de-duping by ID.
    - Fix KptManagementConflictError to return a ManagementConflictError,
      not just a ResourceError. The currentManager will be UNKNOWN,
      but it only seems to be used by reportRootSyncConflicts, which
      only reports conflicts from the remediator, not the applier.
    
    Reduce memory/cpu step size to 128Mi:125m
    
    Reduce memory step size to 32Mi
    
    This allows scaling down to 256Mi total reconciler Pod memory, not
    just 512Mi. However, on non-bursting autopilot clusters, the
    minimum is still 512Mi.
    
    Avoid invalid periodic sync status updates
    
    Skip sync status updates when the source cache is reset, otherwise
    the commit that was previously synced will be removed.
    
    Refactor Applier to expose an event handler
    
    - Use the new event handler to send errors and inventory updates
      to the caller (Updater) so it can track them without locks in
      the applier.
    - Update the Updater to cache the sync status with a shared lock.
    - Add initialization to the Updater to refresh the sync status
      using the RootSync/RepoSync status. This should avoid losing
      status fields after the reconciler is restarted or rescheduled.
      Unfortunately, this doesn't work for errors, because they can't
      be parsed back into typed errors. But this shouldn't be a real
      problem because the applier was invalidating them before each
      apply anyway.
    - Move periodic sync status updates into the Updater.
    - Add a baseFinalizer class to handle collecting destroyer errors.
    
    Double memory per object
    
    Even though the memory limit seems higher than usage, the reconciler
    is still being killed by the kernel because the node is out of
    memory on autopilot. So we need to be a little more defensive about
    requesting more memory than we need in order to avoid OOMKills on
    crowded nodes.
    
    Add TestStressProfileResourcesByObjectCount
    
    Fix e2e tests that revert manual changes
    
    In order for drift correction to work now, the object needs the
    correct owning-inventory label used by the watch filter.
    Without the label, drift will not be reverted.
    
    Add TestStressProfileResourcesByObjectCountWithMultiSync
    
    Fix waiting for deployments
    
    Use configsync.gke.io/parent-package-id as filter
    
    - Filter label key: configsync.gke.io/parent-package-id
    - PACKAGE_ID: <PACKAGE_ID_FULL[0:53]>-<hex(fnv32a(PACKAGE_ID_FULL))>
    - PACKAGE_ID_FULL: <NAME>.<NAMESPACE>.<RootSync|RepoSync>
    - Pre-pad the hex of the hash to 8 characters for consistent length
    
    Add configsync.gke.io/package-id label to RSyncs
    
    - Use reconciler-manager field manager when updating objects, to
      distinguish from the reconciler. Otherwise SSA from a parent
      reconciler managing an RSync will delete any metadata updates made
      by the reconciler-manager.
    
    Fix rebase errors
    
    Fix labels in TestKubectl tests
    
    Use latest cli-utils watch filters
    karlkfi committed May 28, 2024
    Configuration menu
    Copy the full SHA
    3b80a65 View commit details
    Browse the repository at this point in the history