Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Configure VPA for reconciler, if enabled #691

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Jun 16, 2023

  • Configure VPA for reconciler deployment using annotation on
    RootSync/RepoSync:
    configsync.gke.io/reconciler-autoscaling-strategy: Auto
    • Auto - evict and recreate pods to apply recommended resource
      values, as needed.
    • Recommend - monitor and record recommended resource values for
      each reconciler, but don't automatically apply them.
    • Disabled - Do not apply any VPA config, and delete it if it
      exists with the same name as the reconciler.
  • VPA disabled by default (opt-in for preview and testing)
  • When VPA is enabled, set smaller resource requests/limits for
    smaller footprint on initial install. Adding limits helps
    hasten VPA adjustments by causing OOMKills, instead of waiting
    for the VPA to evict the pod.
  • Move regular (non-VPA) defaults out of a ConfigMap and into the
    reconciler-manager code, next to the new VPA resource defaults.
    This should make them easier to keep in sync.
  • test: Install VPA on kind when --vpa is specified
  • test: Enable the VPA addon in GKE when creating clusters when
    --vpa is specified
  • test: Rewrite some e2e tests to handle resource defaults
  • test: Log reconciler pod resources on test failure to help debug
    VPA.

Design: go/config-sync-reconciler-autoscaling

Bug: b/289388701

Depends On:

Notes:

  • The helm-sync container doesn't seem to handle OOMKills very well. The helm CLI is being executed and exits with a killed message, but it doesn't seem to recover and scale up fast enough to avoid e2e test timeout.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karlkfi karlkfi requested review from sdowell and nan-yu and removed request for victorpras June 27, 2023 06:25
@karlkfi karlkfi force-pushed the karl-vpa branch 5 times, most recently from aff42e9 to f724628 Compare June 29, 2023 22:31
@karlkfi karlkfi force-pushed the karl-vpa branch 8 times, most recently from 75cc4c5 to 07e3b89 Compare July 12, 2023 22:40
@karlkfi karlkfi force-pushed the karl-vpa branch 5 times, most recently from 788c243 to e005fb5 Compare July 19, 2023 22:17
@karlkfi karlkfi force-pushed the karl-vpa branch 2 times, most recently from 279e471 to 30179cc Compare July 20, 2023 03:35
@karlkfi karlkfi force-pushed the karl-vpa branch 7 times, most recently from 8390ecb to 963434f Compare July 29, 2023 18:27
}
}
return nil
}

func (r *reconcilerBase) validateAnnotations(_ context.Context, rs client.Object) error {
autoscalingStrategy := reconcilerAutoscalingStrategy(rs)
if autoscalingStrategy != metadata.ReconcilerAutoscalingStrategyAuto &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably would be cleaner with a switch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's better, but i changed it. lmk.

case metadata.ReconcilerAutoscalingStrategyDisabled:
// delete if VPA is installed
if !vpaEnabled {
r.logger(ctx).Info("Managed object delete skipped - not enabled",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doesn't seem like a skip, given there's nothing to delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the log here.

Name: reconcilerRef.Name,
}
var updateMode autoscalingv1.UpdateMode
if autoscale {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: I feel like using the boolean here doesn't help much with readability. I feel it would be clearer just comparing directly to the declared consts (e.g. with a switch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Boolean is different than just strategy == Auto here. The Boolean means Auto + APIService exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this is a different place than I thought you were commenting on. Fixed it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the metrics-server component?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VPA requires the metrics-server. The metrics-server watches resource usage and exposes them as custom metrics.

@karlkfi karlkfi force-pushed the karl-vpa branch 5 times, most recently from 38f0141 to 66ac195 Compare July 31, 2023 18:28
@karlkfi karlkfi force-pushed the karl-vpa branch 4 times, most recently from 278ac82 to b0bddf2 Compare September 11, 2023 21:41
- Configure VPA for reconciler deployment using annotation on
  RootSync/RepoSync:
  `configsync.gke.io/reconciler-autoscaling-strategy: Auto`
  - Auto - evict and recreate pods to apply recommended resource
    values, as needed.
  - Recommend - monitor and record recommended resource values for
    each reconciler, but don't automatically apply them.
  - Disabled - Do not apply any VPA config, and delete it if it
    exists with the same name as the reconciler.
- VPA disabled by default (opt-in for preview and testing)
- When VPA is enabled, set smaller resource requests/limits for
  smaller footprint on initial install. Adding limits helps
  hasten VPA adjustments by causing OOMKills, instead of waiting
  for the VPA to evict the pod.
- Move regular (non-VPA) defaults out of a ConfigMap and into the
  reconciler-manager code, next to the new VPA resource defaults.
  This should make them easier to keep in sync.
- test: Install VPA on kind when --vpa is specified
- test: Enable the VPA addon in GKE when creating clusters when
  --vpa is specified
- test: Rewrite some e2e tests to handle resource defaults
- test: Log reconciler pod resources on test failure to help debug
  VPA.
Hopefully this leads to fewer oomkills
Wait for maagement conflict metric after manual change, with the
current sync labels.
@karlkfi
Copy link
Contributor Author

karlkfi commented Sep 13, 2023

/hold

This PR is on hold until the metrics-server is fixed to be highly available. When the metrics-server in unhealthy, config sync's api discovery breaks, which makes tests fail.

@mikebz
Copy link
Contributor

mikebz commented Apr 17, 2024

curious if this is still WIP or if it's even relevant. If it is consider closing the PR and keeping the private branch or converting to draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants