Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology controller should not take ownership of cluster.spec.topology #6736

Closed
chrischdi opened this issue Jun 26, 2022 · 4 comments · Fixed by #6773
Closed

Topology controller should not take ownership of cluster.spec.topology #6736

chrischdi opened this issue Jun 26, 2022 · 4 comments · Fixed by #6773
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug.

Comments

@chrischdi
Copy link
Member

chrischdi commented Jun 26, 2022

What steps did you take and what happened:

Checked output of pull-cluster-api-e2e-full-main:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-e2e-full-main/1540324488158121984

Especially the cluster object and the resulted managed fields, the interesting part:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cluster.x-k8s.io/v1beta1","kind":"Cluster","metadata":{"annotations":{},"labels":{"cni":"k8s-upgrade-and-conformance-l9e4bu-crs-0"},"name":"k8s-upgrade-and-conformance-l9e4bu","namespace":"k8s-upgrade-and-conformance-9sta2j"},"spec":{"clusterNetwork":{"pods":{"cidrBlocks":["192.168.0.0/16"]},"serviceDomain":"cluster.local","services":{"cidrBlocks":["10.128.0.0/12"]}},"topology":{"class":"quick-start","controlPlane":{"metadata":{},"replicas":3},"variables":[{"name":"etcdImageTag","value":""},{"name":"coreDNSImageTag","value":""}],"version":"v1.23.6","workers":{"machineDeployments":[{"class":"default-worker","failureDomain":"fd4","name":"md-0","replicas":1}]}}}}
  creationTimestamp: "2022-06-24T13:52:55Z"
  finalizers:
  - cluster.cluster.x-k8s.io
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: k8s-upgrade-and-conformance-l9e4bu
    cni: k8s-upgrade-and-conformance-l9e4bu-crs-0
    topology.cluster.x-k8s.io/owned: ""
  managedFields:
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          f:cluster.x-k8s.io/cluster-name: {}
          f:cni: {}
          f:topology.cluster.x-k8s.io/owned: {}
      f:spec:
        f:clusterNetwork:
          f:pods:
            f:cidrBlocks: {}
          f:serviceDomain: {}
          f:services:
            f:cidrBlocks: {}
        f:controlPlaneRef:
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
        f:infrastructureRef:
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
        f:topology: // THE INTERESTING PART
          f:class: {}
          f:controlPlane:
            f:metadata: {}
            f:replicas: {}
          f:variables: {}
          f:version: {}
          f:workers:
            f:machineDeployments: {}
    manager: capi-topology
    operation: Apply
    time: "2022-06-24T14:12:06Z"
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:cni: {}
      f:spec:
        .: {}
        f:clusterNetwork:
          .: {}
          f:pods:
            .: {}
            f:cidrBlocks: {}
          f:serviceDomain: {}
          f:services:
            .: {}
            f:cidrBlocks: {}
        f:topology:
          .: {}
          f:class: {}
          f:controlPlane:
            .: {}
            f:metadata: {}
            f:replicas: {}
          f:workers:
            .: {}
            f:machineDeployments: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2022-06-24T13:52:55Z"

[0]

The interesting part here is the co-ownership of the path spec.topology and fields bellow.
These are fields which are normally read as input for computing the desired state and are normally never written by the topology controller.

The same may apply to the following paths but need to be verified:

  • .spec.paused
  • spec.clusterNetwork

The issue is: if the topology controller reconciles and at the same time another controller or user changes one of the co-owned fields, the topology controller may overwrite the changed data.

E.g.:

  • topology controller starts reconciling cluster object, using the old values (Cluster.spec.topology.workers.machineDeployments[0].replicas == 3)
  • user edits cluster and sets Cluster.spec.topology.workers.machineDeployments[0].replicas = 100 and patches it to the api server
  • topology controller finishes reconciliation which include spec changes, by that resets the cluster object to Cluster.spec.topology.workers.machineDeployments[0].replicas == 3

What did you expect to happen:

Topology controller does not overwrite cluster fields which it is not supposed to write to.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

  1. Possible solution: Note: does not work out of the box, would require bigger refactoring in patchEngine.Apply
  • adjust computeCluster in topology controllers desired_state.go :
    As alternative computeCluster at

    func computeCluster(_ context.Context, s *scope.Scope, infrastructureCluster, controlPlane *unstructured.Unstructured) *clusterv1.Cluster {

    to only contain set the relevant paths supposed to get set via topology controller:

    • metadata.labels
    • cluster.Spec.InfrastructureRef
    • cluster.Spec.ControlPlaneRef

    as well ass other required metadata

    • metadata.name
    • metadata.namespace
  1. Possible solution:
    This could be adjusted by adding the unrelated fields to the ignorePaths at:
  1. Possible solutions: other variants

xref: issue discovered while testing dry-run server side apply at #6710 but should potentially already affect main.

Environment:

  • Cluster-api version:
  • minikube/kind version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 26, 2022
@chrischdi
Copy link
Member Author

cc @fabriziopandini @sbueringer

@chrischdi
Copy link
Member Author

chrischdi commented Jun 26, 2022

/area topology

I think a fix should get cherry picked to v1.2

@chrischdi
Copy link
Member Author

chrischdi commented Jun 26, 2022

#6738 implements option 1:

  • We always clearly define our desired cluster object which results in which fields the topology controller will own
  • We cannot forget to add a field to the filtering at
    patchHelper, err := r.patchHelperFactory(s.Current.Cluster, s.Desired.Cluster, structuredmerge.IgnorePaths{

    because we already take care to only add the relevant fields to the desired object in computeCluster.

Cons of option 2 and current implementation:

  • We have two places where we define the intent of our desired Custer object / which fields will be owned by the topology controller for the Cluster object
  • We would define pretty late which fields we have an option on (in reconcileCluster), compared to option 1 where we will do it at the earliest possible stage.

Edit: option 1 does not work out out of the box, because desired.Cluster.Spec.Topology is still getting used inside patch engine.

@chrischdi
Copy link
Member Author

#6739 implements option 2 which simply extends the behaviour already done for spec.controlPlaneEndpoint for the other not-topology managed fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment