Add Alpha Support for Server Side Apply (2nd attempt) #246

turkenh · 2024-05-16T18:15:43Z

Description of your changes

This PR adds alpha support for Server Side Apply, renovating the previous attempt.

Fixes #37
Fixes #57
Fixes #114
Fixes #115
Fixes #145
Fixes #269

I have:

Read and followed Crossplane's contribution process.
Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Automated

UPTEST_EXAMPLE_LIST="examples/object/object-ssa-owner.yaml" make e2e

Manuel

Enable server side apply.
Apply examples/object/object-ssa-owner.yaml
Apply examples/object/object-ssa-labeler.yaml
Ensure the created service doesn't have last-applied-annotation.
Ensure the created service has labels from both Objects.
Ensure the created service has managedFields set properly.

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2024-05-16T18:08:16Z"
  labels:
    another-key: another-value
    some-key: some-value
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:some-key: {}
      f:spec:
        f:ports:
          k:{"port":80,"protocol":"TCP"}:
            .: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:selector: {}
    manager: provider-kubernetes/sample-service-owner
    operation: Apply
    time: "2024-05-16T18:08:15Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:another-key: {}
    manager: provider-kubernetes/sample-service-labeler
    operation: Apply
    time: "2024-05-16T18:08:26Z"
  name: sample-service
  namespace: default
  resourceVersion: "1142"
  uid: a7abcf8a-ae1d-427d-8df9-a9e53a4b2ce9
spec:
  clusterIP: 10.96.47.67
  clusterIPs:
  - 10.96.47.67
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 80
    protocol: TCP
    targetPort: 9376
  selector:
    app.kubernetes.io/name: MyApp
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

turkenh · 2024-05-16T18:18:03Z

internal/controller/object/object.go

+		// the apiserver, so that we can compare it with the extracted state to
+		// decide whether the object is up-to-date or not.
+		desiredObj := desired.DeepCopy()
+		if err := c.client.Patch(ctx, desiredObj, client.Apply, client.ForceOwnership, client.FieldOwner(ssaFieldOwner(cr.Name)), client.DryRunAll); err != nil {


This a workaround for kubernetes/kubernetes#115563

internal/controller/object/object.go

sttts · 2024-05-17T16:07:04Z

internal/controller/object/indexes.go

@@ -123,8 +123,10 @@ func enqueueObjectsForReferences(ca cache.Cache, log logging.Logger) func(ctx co
 		}
 		// queue those Objects for reconciliation
 		for _, o := range objects.Items {
-			log.Info("Enqueueing Object because referenced resource changed", "name", o.GetName(), "referencedGVK", rGVK.String(), "referencedName", ev.Object.GetName(), "providerConfig", pc)
-			q.Add(reconcile.Request{NamespacedName: types.NamespacedName{Name: o.GetName()}})
+			if o.Spec.Watch {


why is this necessary?

We only want to enqueue the Object if spec.watch is set to true. Not every Object wants to watch the referenced resources. If you're questioning why we have such a field in the first place, I have a comment for that here.

Not questioning the watch field. But why we make the q.Add conditional on it. I understand that we don't start watches on their referenced resources, but here we are not doing that, but queuing them.

Ah, I see.

Watches started based on GVK not including their namespaced name, so, another object with spec.watch: true might have started a watch on that given GVK. For example, think two Objects; one referencing secret: foo with watch: true, another one referencing secret: bar with watch: false. Even though the latter does not start watch on Secrets, it was started with the former. So, we need to filter out here, while adding to the queue otherwise the latter object will be getting watch events even not desired.

sttts · 2024-05-17T16:11:05Z

internal/controller/object/object.go

+			return managed.ExternalObservation{}, errors.Wrap(err, "cannot extract SSA")
+		}
+	}
+


This change let's me question the approach. Why do we have to compare the observation with our desired state? Why don't we watch the object, and re-apply our desired state on every event? If we have a apiserver roundtrip anyway for dry-run, then we can also just directly server-side-apply.

re-apply our desired state on every event

That wouldn't work since, in some cases, re-applying the same manifest simply bumps the resourceVersion and causing another update event. So, if we implement it in that way, we would end up in an update loop, i.e., apply -> update event -> apply -> update event ...

There are various issues that I have a hunch that all boils down to the same root cause:

Server-side apply bumps resourceVersion when re-applying the same object kubernetes/kubernetes#124605

Server side apply increments resource version for certain fields every second kubernetes/kubernetes#121404

Server-side apply has trouble with empty map kubernetes/kubernetes#124050

applyconfiguration.Extract has poor interaction with defaulting kubernetes/kubernetes#115563

If we have a apiserver roundtrip anyway for dry-run

So, if the above bug didn't exist, most probably we wouldn't need that dry-run as well. This is already a workaround for that. Once it is fixed (and prevalent enough), we can remove this workaround hence get rid of that dry run call.

So, if we implement it in that way, we would end up in an update loop, i.e., apply -> update event -> apply -> update event

Given that composition functions uses SSA to apply composed resources, should we expect the same problem in c/c if realtime compositions are used together with functions?

Probably.

I reproduced with kubectl using the deployment in this issue, and know that an empty map triggers it. So, it doesn't happen for all resources but for some. I am almost sure there will be some MRs resulting in similar behavior under certain conditions. This is reproducing with a CRD / CR for example.

That's quite unfortunate. 🙁 I've added some notes to the SSA and realtime composition lifecycle tracking issues just in case we see this in the wild.

bobh66 · 2024-06-19T15:26:22Z

@turkenh @negz @sttts Where does this PR stand? We are seeing problems similar to #242 and #114 so if there is any way to get this into a release for alpha testing that would be great. Thanks

turkenh · 2024-07-02T08:15:23Z

@turkenh @negz @sttts Where does this PR stand? We are seeing problems similar to #242 and #114 so if there is any way to get this into a release for alpha testing that would be great. Thanks

@bobh66 just pushed a commit wrapping the work up with best effort. I believe the PR is ready to go.

@sttts opened a follow up issue to avoid the dry-run calls we discussed above. We will make sure to resolve it before promoting the feature to beta (and/or shipping v1.0.0)

sttts · 2024-07-02T08:22:38Z

cmd/provider/main.go

@@ -66,6 +66,7 @@ func main() {

 		enableManagementPolicies = app.Flag("enable-management-policies", "Enable support for Management Policies.").Default("true").Envar("ENABLE_MANAGEMENT_POLICIES").Bool()
 		enableWatches            = app.Flag("enable-watches", "Enable support for watching resources.").Default("false").Envar("ENABLE_WATCHES").Bool()
+		enableServerSideApply    = app.Flag("enable-server-side-apply", "Enable server side apply to sync object manifests to k8s API.").Default("false").Envar("ENABLE_SERVER_SIDE_APPLY").Bool()


Is this meant as a feature gate? Doesn't look like from the name. I don't like to force ppl to understand so low-level concepts, i.e. this kind of flag should not exist medium-term.

What is wrong with name, what could be an alternative?

I agree that people shouldn't need to understand details on how we sync the desired state, but don't feel comfortable to replace existing mechanism in place, so wanted to put behind a feature flag. I think similar to https://github.com/crossplane/crossplane/blob/2a427fc89da4a2e86f75eaecfa8328f53e19f33b/cmd/crossplane/core/core.go#L115

If this is a feature gate that goes away after alpha/beta, then this is fine. Wasn't sure whether it is meant to stay forever as a knob.

sttts · 2024-07-02T08:24:30Z

internal/controller/object/object.go

+	}
+
+	if c.ssaEnabled {
+		dc, err := discovery.NewDiscoveryClientForConfig(rc)


This looks wrong. It is called on every reconcile, isn't it? Instantiating a new discovery client is heavy.

The problem is, this is not something static, we need to initialize discovery client for the kubeconfig configured in the ProviderConfig referenced in the Object. It can change anytime, so, we need to initialize during reconcile unless we do some sort of caching which doesn't feel like trivial.

We have been digging a bit deeper. The extractor below downloads the OpenAPI v2 schema via dc. This is a big (multi-megabyte) proto file. It will grow even more (a lot) if you have many CRDs from providers installed. And the reconciler will do that on every reconcile, i.e. tens-of-megabyte download and unmarshalling. This is likely a blocker for the approach 🔴.

The solution might be caching of the discovery client by provider config.

turkenh#3 is merged into this feature branch, that implements a custom CachingUnstructuredExtractor, which utilizes OpenAPI v3 discovery and caching per GV.

The original blocker problem is solved with the caching.

bobh66 · 2024-08-29T15:48:28Z

@turkenh is there a way to move this forward?

turkenh · 2024-09-03T08:33:13Z

@turkenh is there a way to move this forward?

@bobh66 we are on it.
@erhancagirici is working on resolving the two points blocking this PR.

sttts · 2024-09-13T11:42:17Z

pkg/kube/client/ssa/caching_unstructured_extractor.go

+)
+
+// cachingUnstructuredExtractor is a caching implementation of v1.UnstructuredExtractor
+// using OpenAPI V3 discovery information.


be clear what it caches.

sttts · 2024-09-13T11:56:32Z

pkg/kube/client/ssa/gvk_parser_cache.go

+func (cm *GVKParserCacheManager) LoadOrNewCacheForProviderConfig(pc *v1alpha1.ProviderConfig) (*GVKParserCache, error) {
+	cm.mu.Lock()
+	defer cm.mu.Unlock()
+	sc, ok := cm.cacheStore[pc.GetUID()]


why is the uid enough? Couldn't the content change?

cache manager stores the collection of caches per provider config.
The actual cache is the one that is returned, this is done at Connect time and given to the reconciliation scope.
Then the content changes are handled there.

At connect time you call LoadOrNewCacheForProviderConfig, but it will return the same object on every reconcile. If the ProviderConfig is changes, the UID will still be the same and with that the cache is potentially invalid. What do I miss?

At connect time you call LoadOrNewCacheForProviderConfig, but it will return the same object on every reconcile.

yes, we get the same GvkParserCache here for the PC. LoadOrNewCacheForProviderConfig just gives you the actual cache to maintain. We can think of GvkParserCacheManager like a mere storage organizer of caches, it just points you to the right drawer reserved for PC. Then we pass the GvkParserCache we got to the extractor at Connect time.

The cache itself is maintained after that point by the extractor here

at each extract request:
A v3 discovery is initiated that returns a collection of all served gv paths to Etagged API paths, then according to the output :

stale entries are invalidated (non-matching Etags, APIs not being served anymore)

then we retrieve the schema (if it is not on the cache after invalidations) for the relevant GV of object that is being extracted, and cache it with its ETag.

So, in the case of the ProviderConfig change, the returned discovery information will change, as the discovery client itself is rebuilt at each reconcile. Then, the invalidations will take place, so we operate on up-to-date info.

Ack. Seems sounds. This is crucial information that needs to be in some comments

…ache manager Signed-off-by: Erhan Cagirici <erhan@upbound.io>

… config Signed-off-by: Erhan Cagirici <erhan@upbound.io>

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

This reverts commit 3ea9c0b. Signed-off-by: Erhan Cagirici <erhan@upbound.io>

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

sttts · 2024-09-17T11:47:02Z

pkg/kube/client/ssa/cache/extractor/openapi_groupversion.go

+
+/*
+This file is forked from upstream kubernetes/client-go
+https://github.com/kubernetes/client-go/blob/0b9a7d2f21befcfd98bf2e62ae68ea49d682500d/openapi/groupversion.go


sttts · 2024-09-17T11:52:48Z

examples/object/testhooks/validate-ssa.sh

+  echo "SSA validation failed! Labels 'some-key' and 'another-key' from both Objects should exist with values 'some-value' and 'another-value' respectively!"
+  #exit 1
+fi
+echo "Successfully validated the SSA feature!"


is this the only e2e test we have? 😱

Currently, we rely on the uptest framework for e2e tests just like other providers and what we can do with uptest is limited, but still better than nothing. Also due to some current limitations that is about to be solved, we couldn't really integrate this test into the CI.

I would suggest:

Merging this PR by manually triggering those tests.

Fast follow to enable them in CI with the corresponding uptest fix, so that we can make sure they run for every PR.

Creating a ticket to discuss whether we need a more sophisticated testing framework here.

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

sttts · 2024-09-17T12:30:38Z

My comments are addressed. So as far as I can judge, and under the lack of proper e2e tests, this sgtm.

Let's hope there is no firework 💥🍾🤞

turkenh mentioned this pull request May 16, 2024

Add Alpha Support for Server Side Apply #134

Closed

2 tasks

turkenh commented May 16, 2024

View reviewed changes

turkenh requested review from sttts and negz May 16, 2024 18:21

negz reviewed May 16, 2024

View reviewed changes

internal/controller/object/object.go Outdated Show resolved Hide resolved

internal/controller/object/object.go Outdated Show resolved Hide resolved

internal/controller/object/object.go Outdated Show resolved Hide resolved

jbw976 mentioned this pull request May 16, 2024

Update existing resources #243

Closed

2 tasks

turkenh force-pushed the ssa-second-attempt branch from b6ab42d to 2e2b035 Compare May 17, 2024 09:04

sttts reviewed May 17, 2024

View reviewed changes

internal/controller/object/object.go Show resolved Hide resolved

sttts reviewed May 17, 2024

View reviewed changes

turkenh force-pushed the ssa-second-attempt branch 3 times, most recently from 3f3a602 to c6d5a72 Compare May 21, 2024 21:59

This was referenced May 22, 2024

Promote Realtime Compositions to Beta crossplane/crossplane#4828

Open

Promote claim server-side apply to beta crossplane/crossplane#5656

Open

turkenh force-pushed the ssa-second-attempt branch 2 times, most recently from b2d72c5 to db2a37c Compare June 2, 2024 19:28

turkenh mentioned this pull request Jul 2, 2024

Implement extracted state cache to avoid SSA dry run calls #269

Closed

turkenh force-pushed the ssa-second-attempt branch from a1a9349 to db2be77 Compare July 2, 2024 08:15

sttts reviewed Jul 2, 2024

View reviewed changes

erhancagirici mentioned this pull request Aug 19, 2024

add CachingUnstructuredExtractor that caches GVK parsers per provider config turkenh/provider-kubernetes#3

Merged

2 tasks

erhancagirici force-pushed the ssa-second-attempt branch 2 times, most recently from 7fdc130 to f3c2080 Compare September 12, 2024 16:28

sttts reviewed Sep 13, 2024

View reviewed changes

erhancagirici added 22 commits September 16, 2024 22:43

pass state cache removal function to external client instead of the c…

b6b5a6f

…ache manager Signed-off-by: Erhan Cagirici <erhan@upbound.io>

add CachingUnstructuredExtractor that caches GVK parsers per provider…

2f52697

… config Signed-off-by: Erhan Cagirici <erhan@upbound.io>

refactor forked code to seperate files

459c241

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

validate all OpenAPI refs are contained in the OpenAPI specs

e0c2969

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

address reviews & refactor

6ed51bc

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

add unit tests for caching unstructured extractor

ebdedd7

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

fix linter issues

3c81ae4

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

non-pointer mutexes & move above protected entity

31135af

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

capitalize Gvk

39acb53

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

rename buildCacheManagerStore to avoid conflict

0d5cfb3

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

move extraction testdata to files

af39abb

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

fix import order of embed package for gci

46ac77a

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

lazily load state manager cache for MR

cd578c8

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

upgrade uptest to v1.2.1 and enable SSA e2e test

d5616a5

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

Revert "upgrade uptest to v1.2.1 and enable SSA e2e test"

e5e7adc

This reverts commit 3ea9c0b. Signed-off-by: Erhan Cagirici <erhan@upbound.io>

restructure ssa cache code in separate packages

716be07

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

switch to standard mutex

7f74deb

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

use aggregate error package for building multiple ref errors

81c421c

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

add comment for GvkParser forking reason

89e0397

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

remove etag from customOAPIGroupVersion

9f52210

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

do not block cache when fetching OpenAPI schemas

e8b8576

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

rename stuttering interface names for StateCache and StateCacheManager

c1eb6c2

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

erhancagirici force-pushed the ssa-second-attempt branch from 656ce55 to c1eb6c2 Compare September 16, 2024 19:45

add pkg to GO_SUBDIRS to enable missing unit tests

8752c45

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

sttts reviewed Sep 17, 2024

View reviewed changes

add comment for forking reason to openapi_groupversion.go

88580c6

Signed-off-by: Erhan Cagirici <erhan@upbound.io>

sttts approved these changes Sep 17, 2024

View reviewed changes

turkenh merged commit c5b4b82 into crossplane-contrib:main Sep 17, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Alpha Support for Server Side Apply (2nd attempt) #246

Add Alpha Support for Server Side Apply (2nd attempt) #246

turkenh commented May 16, 2024 •

edited by erhancagirici

Loading

turkenh May 16, 2024

sttts May 17, 2024

turkenh May 20, 2024

sttts May 21, 2024

turkenh Jun 2, 2024

sttts May 17, 2024

turkenh May 20, 2024 •

edited

Loading

negz May 20, 2024

turkenh May 21, 2024

negz May 22, 2024

bobh66 commented Jun 19, 2024

turkenh commented Jul 2, 2024

sttts Jul 2, 2024

turkenh Jul 2, 2024

sttts Sep 13, 2024

sttts Jul 2, 2024

turkenh Jul 2, 2024 •

edited

Loading

sttts Jul 2, 2024

erhancagirici Sep 12, 2024

sttts Sep 13, 2024

bobh66 commented Aug 29, 2024

turkenh commented Sep 3, 2024

sttts Sep 13, 2024

sttts Sep 13, 2024

erhancagirici Sep 13, 2024

sttts Sep 13, 2024 •

edited

Loading

erhancagirici Sep 13, 2024 •

edited

Loading

sttts Sep 13, 2024

sttts Sep 17, 2024

sttts Sep 17, 2024

turkenh Sep 17, 2024

sttts commented Sep 17, 2024 •

edited

Loading

Add Alpha Support for Server Side Apply (2nd attempt) #246

Add Alpha Support for Server Side Apply (2nd attempt) #246

Conversation

turkenh commented May 16, 2024 • edited by erhancagirici Loading

Description of your changes

How has this code been tested

Automated

Manuel

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turkenh May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobh66 commented Jun 19, 2024

turkenh commented Jul 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turkenh Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobh66 commented Aug 29, 2024

turkenh commented Sep 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

erhancagirici Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts commented Sep 17, 2024 • edited Loading

turkenh commented May 16, 2024 •

edited by erhancagirici

Loading

turkenh May 20, 2024 •

edited

Loading

turkenh Jul 2, 2024 •

edited

Loading

sttts Sep 13, 2024 •

edited

Loading

erhancagirici Sep 13, 2024 •

edited

Loading

sttts commented Sep 17, 2024 •

edited

Loading