Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 [WIP] [DNR] Reproduce clusterctl upgrade e2e test flake #8120

Conversation

sbueringer
Copy link
Member

What this PR does / why we need it:
Currently contains:

  • Improve KCP logging
  • Improve e2e test assertion
  • Increase timeout for which Machines have to remain stable

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sbueringer. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sbueringer
Copy link
Member Author

cc @furkatgofurov7

I looked into the test artifacts of a failed test. It looks like KCP rolls out a new Machine for some reason after the upgrade to main. I improved the logging in KCP and in the e2e test. That should give us more data as soon as we hit the flake again.

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

1 similar comment
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@furkatgofurov7
Copy link
Member

cc @furkatgofurov7

I looked into the test artifacts of a failed test. It looks like KCP rolls out a new Machine for some reason after the upgrade to main. I improved the logging in KCP and in the e2e test. That should give us more data as soon as we hit the flake again

Hey @sbueringer thanks a lot, I was testing just the timeout increase in #8119 and it seems to be passing in a row. But I do agree with the ^, it is most likely a rollout issue rather than a timeout.

@sbueringer
Copy link
Member Author

My theory is that it's a race condition which produces conditions under which KCP triggers a rollout after upgrade

@sbueringer
Copy link
Member Author

Wrong test case failed
/test pull-cluster-api-e2e-full-main

@furkatgofurov7
Copy link
Member

okay, so we detected a rollout as per logs:

 INFO: Rollout detected
  INFO: Detected new machines
  INFO: New machine clusterctl-upgrade/clusterctl-upgrade-61xx5y-control-plane-4kvpv:
  Object:
    apiVersion: cluster.x-k8s.io/v1beta1
    kind: Machine
    metadata:
      annotations:
        controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{"certSANs":["localhost","127.0.0.1","0.0.0.0","host.docker.internal"]},"controllerManager":{"extraArgs":{"enable-hostpath-provisioner":"true"}},"scheduler":{},"dns":{}}'
      creationTimestamp: "2023-02-16T11:12:10Z"
      finalizers:
      - machine.cluster.x-k8s.io
      generation: 2
      labels:
        cluster.x-k8s.io/cluster-name: clusterctl-upgrade-61xx5y
        cluster.x-k8s.io/control-plane: ""
        cluster.x-k8s.io/control-plane-name: clusterctl-upgrade-61xx5y-control-plane
      managedFields:
      - apiVersion: cluster.x-k8s.io/v1beta1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              f:controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: {}
            f:finalizers:
              .: {}
              v:"machine.cluster.x-k8s.io": {}
            f:labels:
              .: {}
              f:cluster.x-k8s.io/cluster-name: {}
              f:cluster.x-k8s.io/control-plane: {}
              f:cluster.x-k8s.io/control-plane-name: {}
            f:ownerReferences:
              .: {}
              k:{"uid":"91a4ee37-bdcb-4455-9350-2058b1728ed3"}:
                .: {}
                f:apiVersion: {}
                f:blockOwnerDeletion: {}
                f:controller: {}
                f:kind: {}
                f:name: {}
                f:uid: {}
          f:spec:
            .: {}
            f:bootstrap:
              .: {}
              f:configRef: {}
              f:dataSecretName: {}
            f:clusterName: {}
            f:infrastructureRef: {}
            f:version: {}
          f:status:
            .: {}
            f:bootstrapReady: {}
            f:conditions: {}
            f:lastUpdated: {}
            f:observedGeneration: {}
            f:phase: {}
        manager: manager
        operation: Update
        time: "2023-02-16T11:12:10Z"
      name: clusterctl-upgrade-61xx5y-control-plane-4kvpv
      namespace: clusterctl-upgrade
      ownerReferences:
      - apiVersion: controlplane.cluster.x-k8s.io/v1beta1
        blockOwnerDeletion: true
        controller: true
        kind: KubeadmControlPlane
        name: clusterctl-upgrade-61xx5y-control-plane
        uid: 91a4ee37-bdcb-4455-9350-2058b1728ed3
      resourceVersion: "3392"
      uid: b4c3d76d-fd74-4306-a470-99c3cee7938b
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfig
          name: clusterctl-upgrade-61xx5y-control-plane-bkrkw
          namespace: clusterctl-upgrade
          uid: 34323a9d-c8f4-428a-b76c-c0440f0f5603
        dataSecretName: clusterctl-upgrade-61xx5y-control-plane-bkrkw
      clusterName: clusterctl-upgrade-61xx5y
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: DockerMachine
        name: clusterctl-upgrade-61xx5y-control-plane-jxw2w
        namespace: clusterctl-upgrade
        uid: 393302d9-af38-4005-9612-3d721c1f0671
      nodeDeletionTimeout: 10s
      version: v1.26.0
    status:
      bootstrapReady: true
      conditions:
      - lastTransitionTime: "2023-02-16T11:12:16Z"
        message: 1 of 2 completed
        reason: Bootstrapping
        severity: Info
        status: "False"
        type: Ready
      - lastTransitionTime: "2023-02-16T11:12:10Z"
        status: "True"
        type: BootstrapReady
      - lastTransitionTime: "2023-02-16T11:12:16Z"
        message: 1 of 2 completed
        reason: Bootstrapping
        severity: Info
        status: "False"
        type: InfrastructureReady
      - lastTransitionTime: "2023-02-16T11:12:10Z"
        reason: WaitingForNodeRef
        severity: Info
        status: "False"
        type: NodeHealthy
      lastUpdated: "2023-02-16T11:12:10Z"
      observedGeneration: 2
      phase: Provisioning

@sbueringer
Copy link
Member Author

The diff which leads to a rollout is detected here:

return reflect.DeepEqual(&machineConfig.Spec, kcpConfig)

Just impossible to tell from the YAML why reflect.DeepEqual detects a difference

@sbueringer
Copy link
Member Author

Added a lot more logs. let's see if that's good enough to find something

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member Author

sbueringer commented Feb 16, 2023

@furkatgofurov7 I found the issue, it's a fun one...

Context:

  • We introduced a new ImagePullPolicy field in KubeadmConfigSpec.InitConfiguration.NodeRegistration and KubeadmConfigSpec.JoinConfiguration.NodeRegistration
  • The ImagePullPolicy field has a default value which is applied via OpenAPI
  • This happens as soon as there is any kind of update on KCP the existing KubeadmConfig
  • When KCP checks for rollouts it compares the KubeadmConfigSpec in KCP with the existing KubeadmConfig

Example error case:

  • We install Cluster API v0.3
  • We create the cluster including KCP
  • Cluster API is upgraded to main
  • ImagePullPolicy is defaulted in the KubeadmConfig of the Machine. It is not defaulted in KCP
  • KCP controller reconcile detects a diff (InitConfiguration.NodeRegistration.ImagePullPolicy: IfNotPresent != ) and triggers a rollout

The solution is essentially to set the default value in a way that it doesn't make a difference when we calculate if we need a rollout or not

Solution:

  • Move the defaulting from OpenAPI to the default func: 998f5f3
  • Default func is called on both KCP and KubeadmConfig before diffing

@sbueringer
Copy link
Member Author

sbueringer commented Feb 16, 2023

@killianmuldoon ^^ that's an interesting reason why we can't do all defaulting via the OpenAPI schema 😂

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

1 similar comment
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-full-main

Found another issue, this time ImagePullPolicy defaulting triggers a rollout by the Cluster topology reconciler (ImagePullPolicy is not defaulted for the existing KubeadmConfigTemplate, but for the new/desired one => diff => rollout)

@sbueringer
Copy link
Member Author

Other flake
/test pull-cluster-api-e2e-full-main

1 similar comment
@sbueringer
Copy link
Member Author

Other flake
/test pull-cluster-api-e2e-full-main

@sbueringer sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch 2 times, most recently from d6d0c74 to 196d93c Compare February 17, 2023 16:37
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@sbueringer sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch from 8ce0ae6 to 27f5fe1 Compare February 20, 2023 05:48
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

1 similar comment
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch from 319fddd to fec5bd3 Compare February 20, 2023 14:19
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@k8s-ci-robot
Copy link
Contributor

@sbueringer: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-apidiff-main fec5bd3 link false /test pull-cluster-api-apidiff-main
pull-cluster-api-test-main fec5bd3 link true /test pull-cluster-api-test-main
pull-cluster-api-test-mink8s-main fec5bd3 link true /test pull-cluster-api-test-mink8s-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

6 similar comments
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-informing-main

@sbueringer
Copy link
Member Author

/close

CI should be stable again

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closed this PR.

In response to this:

/close

CI should be stable again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants