🌱 Improve dry run for server side apply using an annotation #6709

fabriziopandini · 2022-06-23T10:10:13Z

What this PR does / why we need it:
This proposes an improvement on how we implement dry run for server side apply by introducing an annotation.

The proposed implementation has the following advantages

dry run is executed in memory, without additional access to the api server
the diff logic doesn't rely on managedFields, which we are considering at this stage an internal of SSA instead of an API we can rely on
the diff logic is self-contained and thus resilient to operation not preserving the managedFields, like backup and restore or move, or to users/other controllers modifying managed fields in an unexpected way
the diff logic is less impacted by schema changes

k8s-ci-robot · 2022-06-23T10:10:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from fabriziopandini after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer

First round of review (everything except diff package)
Now continuing with the diff package

sbueringer · 2022-06-23T12:00:04Z

internal/contract/types.go

@@ -34,6 +34,20 @@ func (p Path) Append(k string) Path {
 	return append(p, k)
 }

+// Item return a path for an item element, represented as <path> + []. item elements are used as
+// a placeholder for schema definitions for list items of for map items.


Just that I get it right:

The output of this func for the Path

"spec", "template", "spec"

is a Path

"spec", "template", "spec[]"

I somehow can't map the method name to the functionality. Isn't it rather something like ToArray or AsArray??

Path{"a","b","c"}.ToArray() => Path{"a","b","c[]"}

Should we add a unit test?

sbueringer · 2022-06-23T12:01:46Z

internal/controllers/topology/cluster/cluster_controller.go

@@ -114,7 +115,8 @@ func (r *Reconciler) SetupWithManager(ctx context.Context, mgr ctrl.Manager, opt
 	r.patchEngine = patches.NewEngine(r.RuntimeClient)
 	r.recorder = mgr.GetEventRecorderFor("topology/cluster")
 	if r.patchHelperFactory == nil {
-		r.patchHelperFactory = serverSideApplyPatchHelperFactory(r.Client)
+		crdSchemaCache := diff.NewCRDSchemaCache(r.Client)
+		r.patchHelperFactory = serverSideApplyPatchHelperFactory(r.Client, crdSchemaCache)


nit: Should we just do diff.NewCRDSchemaCache(r.Client) in serverSideApplyPatchHelperFactory? The func already has the client and seems like a good way to avoid having to create the CRDSchemaCache on all callsites

sbueringer · 2022-06-23T12:05:17Z

internal/controllers/topology/cluster/cluster_controller.go

 	}
 }

 // dryRunPatchHelperFactory makes use of a two-ways patch and is used in situations where we cannot rely on managed fields.
 func dryRunPatchHelperFactory(c client.Client) structuredmerge.PatchHelperFactoryFunc {
-	return func(original, modified client.Object, opts ...structuredmerge.HelperOption) (structuredmerge.PatchHelper, error) {
+	return func(ctx context.Context, original, modified client.Object, opts ...structuredmerge.HelperOption) (structuredmerge.PatchHelper, error) {


Suggested change

return func(ctx context.Context, original, modified client.Object, opts ...structuredmerge.HelperOption) (structuredmerge.PatchHelper, error) {

return func(_ context.Context, original, modified client.Object, opts ...structuredmerge.HelperOption) (structuredmerge.PatchHelper, error) {

nit

sbueringer · 2022-06-23T12:07:37Z

internal/controllers/topology/cluster/reconcile_state_test.go

@@ -2520,6 +2537,13 @@ func TestReconcileMachineDeploymentMachineHealthCheck(t *testing.T) {
 				uidsByName[mdts.Object.Name] = mdts.Object.GetUID()

 				if mdts.MachineHealthCheck != nil {
+					// Adds last applied intent annotation assuming the current object has been originated from the topology controller, without external changes.
+					currentIntent := &unstructured.Unstructured{}
+					g.Expect(env.Scheme().Convert(mdts.MachineHealthCheck, currentIntent, nil)).To(Succeed())


I assume this back and forth is because AddCurrentIntentAnnotation only supports Unstructured, correct?

sbueringer · 2022-06-23T12:09:06Z

internal/controllers/topology/cluster/structuredmerge/serversidepathhelper.go

@@ -41,7 +42,7 @@ type serverSidePatchHelper struct {
 }

 // NewServerSidePatchHelper returns a new PatchHelper using server side apply.
-func NewServerSidePatchHelper(original, modified client.Object, c client.Client, opts ...HelperOption) (PatchHelper, error) {
+func NewServerSidePatchHelper(ctx context.Context, original, modified client.Object, c client.Client, crdSchemaCache diff.CRDSchemaCache, opts ...HelperOption) (PatchHelper, error) {


nice!

Can we please propagate ctx also into cleanupLegacyManagedFields?

sbueringer · 2022-06-23T12:13:15Z

internal/controllers/topology/cluster/structuredmerge/serversidepathhelper.go

+		// Gets the schema for the modified object gvk.
+		// NOTE: this schema drives DryRunDiff operations; modified (current intent) and original (current object)
+		// are of the same gvk, given that we are always calling UpdateReferenceAPIContract when reading both of them.
+		// previousIntent instead could be of an older version, but this impacts dryRun only partially (see diff.isDroppingAnyIntent for more details)


I was initially thinking that the controller is always restarted after relevant CRDs change.

But I think this is only true for core CRDs (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-upgrade-v0-3-to-main/1539927677492793344/artifacts/clusters/clusterctl-upgrade-9ovu0z/clusterctl-upgrade.log).

After the core controller and it's CRDs are updated we are updating all other providers. During that CRDs are patched, if providers add new fields to existing apiVersions which frequently happens we will probably have a stale cache here.

What about additionally using the CRD resourceVersion as part of the cache key to avoid this issue?

Potential follow-up issue: Do we care about evicting stale cache entries over time? I think it's not a factor as we assume that CAPI is updated frequently enough so we won't have too many stale entries to run into issues, right?

sbueringer · 2022-06-23T12:21:24Z

internal/controllers/topology/cluster/structuredmerge/diff/schema_cache.go

+	}
+
+	crd := &apiextensionsv1.CustomResourceDefinition{}
+	crd.SetName(fmt.Sprintf("%s.%s", flect.Pluralize(strings.ToLower(gvk.Kind)), gvk.Group))


I'm kind of fine with making that assumption. But should we note somewhere that this might not work with all providers?

We wanted to check if all providers are naming their CRDs correctly in: #5686

sbueringer · 2022-06-23T12:22:26Z

internal/controllers/topology/cluster/structuredmerge/serversidepathhelper.go

@@ -135,6 +152,12 @@ func (h *serverSidePatchHelper) Patch(ctx context.Context) error {
 	log := ctrl.LoggerFrom(ctx)
 	log.V(5).Info("Patching object", "Intent", h.modified)

+	// Stores che current intent as last applied intent.


Suggested change

// Stores che current intent as last applied intent.

// Stores the current intent as last applied intent.

nit

sbueringer · 2022-06-23T12:24:42Z

internal/controllers/topology/cluster/structuredmerge/serversidepathhelper_test.go

@@ -444,3 +462,23 @@ func TestServerSideApply_CleanupLegacyManagedFields(t *testing.T) {
 		g.Expect(gotSSAManager).To(BeTrue())
 	})
 }
+
+// getTopologyManagedFields returns metadata.managedFields entry tracking


Should we just stop caring about what happens to managed fields?

sbueringer

mid 2nd-round of review

sbueringer · 2022-06-23T12:38:33Z

internal/controllers/topology/cluster/structuredmerge/diff/annotation.go

+	var previousIntentJSON bytes.Buffer
+	zr, err := gzip.NewReader(bytes.NewReader(previousIntentJSONGZIP))
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to create gzip reader forprevious intent")


Suggested change

return nil, errors.Wrap(err, "failed to create gzip reader forprevious intent")

return nil, errors.Wrap(err, "failed to create gzip reader for previous intent")

nit

sbueringer · 2022-06-23T12:39:09Z

internal/controllers/topology/cluster/structuredmerge/diff/annotation.go

+	}
+
+	if _, err := io.Copy(&previousIntentJSON, zr); err != nil { //nolint:gosec
+		return nil, errors.Wrap(err, "failed to copy from gzip reader")


Suggested change

return nil, errors.Wrap(err, "failed to copy from gzip reader")

return nil, errors.Wrap(err, "failed to copy previous intent from gzip reader")

nit

sbueringer · 2022-06-23T12:39:20Z

internal/controllers/topology/cluster/structuredmerge/diff/annotation.go

+	}
+
+	if err := zr.Close(); err != nil {
+		return nil, errors.Wrap(err, "failed to close gzip reader for managed fields")


Suggested change

return nil, errors.Wrap(err, "failed to close gzip reader for managed fields")

return nil, errors.Wrap(err, "failed to close gzip reader for previous intent")

sbueringer · 2022-06-23T12:39:47Z

internal/controllers/topology/cluster/structuredmerge/diff/annotation.go

+
+	previousIntentMap := make(map[string]interface{})
+	if err := json.Unmarshal(previousIntentJSON.Bytes(), &previousIntentMap); err != nil {
+		return nil, errors.Wrap(err, "failed to unmarshal managed fields")


Suggested change

return nil, errors.Wrap(err, "failed to unmarshal managed fields")

return nil, errors.Wrap(err, "failed to unmarshal previous intent")

sbueringer · 2022-06-23T12:40:36Z

internal/controllers/topology/cluster/structuredmerge/diff/annotation.go

+		return nil, errors.Wrap(err, "failed to unmarshal managed fields")
+	}
+
+	return &unstructured.Unstructured{Object: previousIntentMap}, nil


Do we have to use the Kubernetes decoder? Otherwise we should run into the int64 vs float64 issue (xref: #6668)

sbueringer · 2022-06-23T12:50:34Z

internal/controllers/topology/cluster/structuredmerge/diff/schema.go

+}
+
+// GetItemWithKeys returns the element with the given keys/values from the list.
+func (ld *ListDef) GetItemWithKeys(list []interface{}, keys map[string]interface{}) map[string]interface{} {


Isn't keys actually a map?

sbueringer · 2022-06-23T12:52:33Z

internal/controllers/topology/cluster/structuredmerge/diff/schema_cache.go

+}
+
+// LoadOrStore returns the CRDSchema value for the gvk, if present.
+// Otherwise, if not present, it retrieves the corresponding CRD definition, infer the CRDSchema, stores it, returns it.


Suggested change

// Otherwise, if not present, it retrieves the corresponding CRD definition, infer the CRDSchema, stores it, returns it.

// Otherwise, if not present, it retrieves the corresponding CRD definition, infers the CRDSchema, stores it and returns it.

sbueringer · 2022-06-23T12:54:52Z

internal/controllers/topology/cluster/structuredmerge/diff/schema_cache.go

+	if s, ok := sc.m[gvk]; ok {
+		return s, nil
+	}


Repeating this here essentially means calling load before is redundant.

We could get rid of load and store move the mutex up to LoadOrStore and keep the schema calculation in a util func if we want

sbueringer · 2022-06-23T12:55:59Z

internal/controllers/topology/cluster/structuredmerge/diff/schema_cache.go

+	crd := &apiextensionsv1.CustomResourceDefinition{}
+	crd.SetName(fmt.Sprintf("%s.%s", flect.Pluralize(strings.ToLower(gvk.Kind)), gvk.Group))
+	if err := sc.client.Get(ctx, client.ObjectKeyFromObject(crd), crd); err != nil {
+		return nil, errors.Wrapf(err, "failed to get CRD for %s", gvk.String())


nit: should we just use the CRD name in the error? Seems easier to understand

sbueringer · 2022-06-23T12:57:19Z

internal/controllers/topology/cluster/structuredmerge/diff/schema_cache.go

+	schema := CRDSchema{}
+	for _, v := range crd.Spec.Versions {
+		if v.Name == gvk.Version && v.Schema != nil && v.Schema.OpenAPIV3Schema != nil {
+			addToSchema(contract.Path{}, schema, v.Schema.OpenAPIV3Schema)


Q: Shouldn't it be an error if we can't find a schema for GVK?

chrischdi · 2022-06-23T16:11:20Z

internal/controllers/topology/cluster/structuredmerge/serversidepathhelper_test.go

@@ -444,3 +462,23 @@ func TestServerSideApply_CleanupLegacyManagedFields(t *testing.T) {
 		g.Expect(gotSSAManager).To(BeTrue())
 	})


Some new / broken test cases, same for #6710 (and maybe main), these testcases should result in the object not getting created.

t.Run("Error on object which has another uid due to immutability", func(t *testing.T) { g := NewWithT(t) // Get the current object (assumes tests to be run in sequence). original := obj.DeepCopy() g.Expect(env.GetAPIReader().Get(ctx, client.ObjectKeyFromObject(original), original)).To(Succeed()) // Create a patch helper for a modified object with some changes to what previously applied by th topology manager. modified := obj.DeepCopy() g.Expect(unstructured.SetNestedField(modified.Object, "changed", "spec", "controlPlaneEndpoint", "host")).To(Succeed()) g.Expect(unstructured.SetNestedField(modified.Object, "changed-by-topology-controller", "spec", "bar")).To(Succeed()) // Set an other uid to original original.SetUID("a-wrong-one") modified.SetUID("") // Create a patch helper which should fail because original's real UID changed. _, err := NewServerSidePatchHelper(ctx, original, modified, env.GetClient(), schemaCache) g.Expect(err).To(HaveOccurred()) }) t.Run("Error on object which does not exist (anymore) but was expected to get updated", func(t *testing.T) { original := builder.TestInfrastructureCluster(ns.Name, "obj3").WithSpecFields(map[string]interface{}{ "spec.controlPlaneEndpoint.host": "1.2.3.4", "spec.controlPlaneEndpoint.port": int64(1234), "spec.foo": "", // this field is then explicitly ignored by the patch helper }).Build() modified := original.DeepCopy() // Set a not existing uid to the not existing original object original.SetUID("does-not-exist") // Create a patch helper which should fail because original does not exist. _, err = NewServerSidePatchHelper(ctx, original, modified, env.GetClient(), schemaCache) g.Expect(err).To(HaveOccurred()) })

fabriziopandini · 2022-06-23T19:35:30Z

/hold
while we discuss if to proceed with this approach or the one proposed in #6710

k8s-ci-robot · 2022-06-27T20:33:09Z

@fabriziopandini: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2022-07-11T07:59:22Z

@fabriziopandini I think we can close this PR, given we chose SSA dry-run?

fabriziopandini · 2022-07-12T15:57:00Z

yes!
/close

k8s-ci-robot · 2022-07-12T15:57:12Z

@fabriziopandini: Closed this PR.

In response to this:

yes!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Improve dry run for server side apply using an annotation

f214f57

fabriziopandini requested review from chrischdi and sbueringer June 23, 2022 10:10

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 23, 2022

k8s-ci-robot requested review from enxebre and stmcginnis June 23, 2022 10:10

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 23, 2022

sbueringer reviewed Jun 23, 2022

View reviewed changes

chrischdi reviewed Jun 23, 2022

View reviewed changes

chrischdi mentioned this pull request Jun 23, 2022

🌱 Improve dry run for topology changes to dry run server side apply #6710

Merged

1 task

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2022

sbueringer added the area/topology label Jun 27, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2022

k8s-ci-robot closed this Jul 12, 2022

fabriziopandini deleted the topology-ssa-improvements branch July 15, 2022 09:50

killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Improve dry run for server side apply using an annotation #6709

🌱 Improve dry run for server side apply using an annotation #6709

fabriziopandini commented Jun 23, 2022

k8s-ci-robot commented Jun 23, 2022

sbueringer left a comment

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer left a comment

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

sbueringer Jun 23, 2022

chrischdi Jun 23, 2022 •

edited

Loading

fabriziopandini commented Jun 23, 2022

k8s-ci-robot commented Jun 27, 2022

sbueringer commented Jul 11, 2022

fabriziopandini commented Jul 12, 2022

k8s-ci-robot commented Jul 12, 2022

	return func(ctx context.Context, original, modified client.Object, opts ...structuredmerge.HelperOption) (structuredmerge.PatchHelper, error) {
	return func(_ context.Context, original, modified client.Object, opts ...structuredmerge.HelperOption) (structuredmerge.PatchHelper, error) {

	// Stores che current intent as last applied intent.
	// Stores the current intent as last applied intent.

	return nil, errors.Wrap(err, "failed to create gzip reader forprevious intent")
	return nil, errors.Wrap(err, "failed to create gzip reader for previous intent")

	// Otherwise, if not present, it retrieves the corresponding CRD definition, infer the CRDSchema, stores it, returns it.
	// Otherwise, if not present, it retrieves the corresponding CRD definition, infers the CRDSchema, stores it and returns it.

		@@ -444,3 +462,23 @@ func TestServerSideApply_CleanupLegacyManagedFields(t *testing.T) {
		g.Expect(gotSSAManager).To(BeTrue())
		})

🌱 Improve dry run for server side apply using an annotation #6709

🌱 Improve dry run for server side apply using an annotation #6709

Conversation

fabriziopandini commented Jun 23, 2022

k8s-ci-robot commented Jun 23, 2022

sbueringer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrischdi Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

fabriziopandini commented Jun 23, 2022

k8s-ci-robot commented Jun 27, 2022

sbueringer commented Jul 11, 2022

fabriziopandini commented Jul 12, 2022

k8s-ci-robot commented Jul 12, 2022

chrischdi Jun 23, 2022 •

edited

Loading