[WIP] Configure VPA for reconciler, if enabled #691

karlkfi · 2023-06-16T01:45:48Z

Configure VPA for reconciler deployment using annotation on
RootSync/RepoSync:
configsync.gke.io/reconciler-autoscaling-strategy: Auto
- Auto - evict and recreate pods to apply recommended resource
  values, as needed.
- Recommend - monitor and record recommended resource values for
  each reconciler, but don't automatically apply them.
- Disabled - Do not apply any VPA config, and delete it if it
  exists with the same name as the reconciler.
VPA disabled by default (opt-in for preview and testing)
When VPA is enabled, set smaller resource requests/limits for
smaller footprint on initial install. Adding limits helps
hasten VPA adjustments by causing OOMKills, instead of waiting
for the VPA to evict the pod.
Move regular (non-VPA) defaults out of a ConfigMap and into the
reconciler-manager code, next to the new VPA resource defaults.
This should make them easier to keep in sync.
test: Install VPA on kind when --vpa is specified
test: Enable the VPA addon in GKE when creating clusters when
--vpa is specified
test: Rewrite some e2e tests to handle resource defaults
test: Log reconciler pod resources on test failure to help debug
VPA.

Design: go/config-sync-reconciler-autoscaling

Bug: b/289388701

Depends On:

Notes:

The helm-sync container doesn't seem to handle OOMKills very well. The helm CLI is being executed and exits with a killed message, but it doesn't seem to recover and scale up fast enough to avoid e2e test timeout.

google-oss-prow · 2023-06-27T02:53:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdowell · 2023-07-30T02:13:06Z

pkg/reconcilermanager/controllers/reconciler_base.go

 		}
 	}
 	return nil
 }

+func (r *reconcilerBase) validateAnnotations(_ context.Context, rs client.Object) error {
+	autoscalingStrategy := reconcilerAutoscalingStrategy(rs)
+	if autoscalingStrategy != metadata.ReconcilerAutoscalingStrategyAuto &&


Probably would be cleaner with a switch

I'm not sure it's better, but i changed it. lmk.

sdowell · 2023-07-30T02:16:19Z

pkg/reconcilermanager/controllers/reconciler_base.go

+	case metadata.ReconcilerAutoscalingStrategyDisabled:
+		// delete if VPA is installed
+		if !vpaEnabled {
+			r.logger(ctx).Info("Managed object delete skipped - not enabled",


nit: doesn't seem like a skip, given there's nothing to delete

removed the log here.

sdowell · 2023-07-30T02:18:52Z

pkg/reconcilermanager/controllers/reconciler_base.go

+			Name: reconcilerRef.Name,
+		}
+		var updateMode autoscalingv1.UpdateMode
+		if autoscale {


minor nit: I feel like using the boolean here doesn't help much with readability. I feel it would be clearer just comparing directly to the declared consts (e.g. with a switch)

The Boolean is different than just strategy == Auto here. The Boolean means Auto + APIService exists.

oh, this is a different place than I thought you were commenting on. Fixed it here.

sdowell · 2023-07-30T03:38:31Z

e2e/testdata/metrics-server/components.yaml

What is the purpose of the metrics-server component?

VPA requires the metrics-server. The metrics-server watches resource usage and exposes them as custom metrics.

- Configure VPA for reconciler deployment using annotation on RootSync/RepoSync: `configsync.gke.io/reconciler-autoscaling-strategy: Auto` - Auto - evict and recreate pods to apply recommended resource values, as needed. - Recommend - monitor and record recommended resource values for each reconciler, but don't automatically apply them. - Disabled - Do not apply any VPA config, and delete it if it exists with the same name as the reconciler. - VPA disabled by default (opt-in for preview and testing) - When VPA is enabled, set smaller resource requests/limits for smaller footprint on initial install. Adding limits helps hasten VPA adjustments by causing OOMKills, instead of waiting for the VPA to evict the pod. - Move regular (non-VPA) defaults out of a ConfigMap and into the reconciler-manager code, next to the new VPA resource defaults. This should make them easier to keep in sync. - test: Install VPA on kind when --vpa is specified - test: Enable the VPA addon in GKE when creating clusters when --vpa is specified - test: Rewrite some e2e tests to handle resource defaults - test: Log reconciler pod resources on test failure to help debug VPA.

Hopefully this leads to fewer oomkills

Wait for maagement conflict metric after manual change, with the current sync labels.

karlkfi · 2023-09-13T19:29:29Z

/hold

This PR is on hold until the metrics-server is fixed to be highly available. When the metrics-server in unhealthy, config sync's api discovery breaks, which makes tests fail.

mikebz · 2024-04-17T20:05:41Z

curious if this is still WIP or if it's even relevant. If it is consider closing the PR and keeping the private branch or converting to draft.

google-oss-prow bot added the do-not-merge/work-in-progress label Jun 16, 2023

google-oss-prow bot requested review from haiyanmeng and victorpras June 16, 2023 01:45

google-oss-prow bot added the size/XXL label Jun 16, 2023

karlkfi force-pushed the karl-vpa branch from d3c8734 to 9ba73d9 Compare June 27, 2023 02:53

karlkfi force-pushed the karl-vpa branch from 9ba73d9 to 310c92f Compare June 27, 2023 06:20

karlkfi requested review from sdowell and nan-yu and removed request for victorpras June 27, 2023 06:25

karlkfi force-pushed the karl-vpa branch 5 times, most recently from aff42e9 to f724628 Compare June 29, 2023 22:31

karlkfi force-pushed the karl-vpa branch 8 times, most recently from 75cc4c5 to 07e3b89 Compare July 12, 2023 22:40

karlkfi force-pushed the karl-vpa branch 5 times, most recently from 788c243 to e005fb5 Compare July 19, 2023 22:17

karlkfi force-pushed the karl-vpa branch 2 times, most recently from 279e471 to 30179cc Compare July 20, 2023 03:35

karlkfi force-pushed the karl-vpa branch 7 times, most recently from 8390ecb to 963434f Compare July 29, 2023 18:27

sdowell reviewed Jul 30, 2023

View reviewed changes

karlkfi force-pushed the karl-vpa branch 5 times, most recently from 38f0141 to 66ac195 Compare July 31, 2023 18:28

karlkfi force-pushed the karl-vpa branch 4 times, most recently from 278ac82 to b0bddf2 Compare September 11, 2023 21:41

karlkfi added 5 commits September 12, 2023 11:31

Reduce number of kind clusters to test in parallel

93218c5

Hopefully this leads to fewer oomkills

tet: Disable otel-agent metrics buffering

54050d3

Always log recorded metrics

9810d28

fix: TestCRDDeleteBeforeRemoveCustomResourceV1

945403c

Wait for maagement conflict metric after manual change, with the current sync labels.

karlkfi force-pushed the karl-vpa branch from b0bddf2 to 945403c Compare September 12, 2023 18:33

google-oss-prow bot added the do-not-merge/hold label Sep 13, 2023

janetkuo removed the do-not-merge/hold label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Configure VPA for reconciler, if enabled #691

[WIP] Configure VPA for reconciler, if enabled #691

karlkfi commented Jun 16, 2023 •

edited

Loading

google-oss-prow bot commented Jun 27, 2023

sdowell Jul 30, 2023

karlkfi Jul 31, 2023

sdowell Jul 30, 2023

karlkfi Jul 31, 2023

sdowell Jul 30, 2023

karlkfi Jul 31, 2023

karlkfi Jul 31, 2023

sdowell Jul 30, 2023

karlkfi Jul 31, 2023

karlkfi commented Sep 13, 2023

mikebz commented Apr 17, 2024

[WIP] Configure VPA for reconciler, if enabled #691

Are you sure you want to change the base?

[WIP] Configure VPA for reconciler, if enabled #691

Conversation

karlkfi commented Jun 16, 2023 • edited Loading

google-oss-prow bot commented Jun 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlkfi commented Sep 13, 2023

mikebz commented Apr 17, 2024

karlkfi commented Jun 16, 2023 •

edited

Loading