🌱 [WIP] [DNR] Reproduce clusterctl upgrade e2e test flake #8120

sbueringer · 2023-02-16T05:58:38Z

What this PR does / why we need it:
Currently contains:

Improve KCP logging
Improve e2e test assertion
Increase timeout for which Machines have to remain stable

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

k8s-ci-robot · 2023-02-16T05:59:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sbueringer. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2023-02-16T05:59:27Z

cc @furkatgofurov7

I looked into the test artifacts of a failed test. It looks like KCP rolls out a new Machine for some reason after the upgrade to main. I improved the logging in KCP and in the e2e test. That should give us more data as soon as we hit the flake again.

sbueringer · 2023-02-16T05:59:39Z

/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-16T08:18:51Z

/test pull-cluster-api-e2e-full-main

furkatgofurov7 · 2023-02-16T08:29:16Z

cc @furkatgofurov7

I looked into the test artifacts of a failed test. It looks like KCP rolls out a new Machine for some reason after the upgrade to main. I improved the logging in KCP and in the e2e test. That should give us more data as soon as we hit the flake again

Hey @sbueringer thanks a lot, I was testing just the timeout increase in #8119 and it seems to be passing in a row. But I do agree with the ^, it is most likely a rollout issue rather than a timeout.

sbueringer · 2023-02-16T08:31:58Z

My theory is that it's a race condition which produces conditions under which KCP triggers a rollout after upgrade

sbueringer · 2023-02-16T09:55:28Z

Wrong test case failed
/test pull-cluster-api-e2e-full-main

furkatgofurov7 · 2023-02-16T12:46:00Z

okay, so we detected a rollout as per logs:

 INFO: Rollout detected
  INFO: Detected new machines
  INFO: New machine clusterctl-upgrade/clusterctl-upgrade-61xx5y-control-plane-4kvpv:
  Object:
    apiVersion: cluster.x-k8s.io/v1beta1
    kind: Machine
    metadata:
      annotations:
        controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{"certSANs":["localhost","127.0.0.1","0.0.0.0","host.docker.internal"]},"controllerManager":{"extraArgs":{"enable-hostpath-provisioner":"true"}},"scheduler":{},"dns":{}}'
      creationTimestamp: "2023-02-16T11:12:10Z"
      finalizers:
      - machine.cluster.x-k8s.io
      generation: 2
      labels:
        cluster.x-k8s.io/cluster-name: clusterctl-upgrade-61xx5y
        cluster.x-k8s.io/control-plane: ""
        cluster.x-k8s.io/control-plane-name: clusterctl-upgrade-61xx5y-control-plane
      managedFields:
      - apiVersion: cluster.x-k8s.io/v1beta1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              f:controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: {}
            f:finalizers:
              .: {}
              v:"machine.cluster.x-k8s.io": {}
            f:labels:
              .: {}
              f:cluster.x-k8s.io/cluster-name: {}
              f:cluster.x-k8s.io/control-plane: {}
              f:cluster.x-k8s.io/control-plane-name: {}
            f:ownerReferences:
              .: {}
              k:{"uid":"91a4ee37-bdcb-4455-9350-2058b1728ed3"}:
                .: {}
                f:apiVersion: {}
                f:blockOwnerDeletion: {}
                f:controller: {}
                f:kind: {}
                f:name: {}
                f:uid: {}
          f:spec:
            .: {}
            f:bootstrap:
              .: {}
              f:configRef: {}
              f:dataSecretName: {}
            f:clusterName: {}
            f:infrastructureRef: {}
            f:version: {}
          f:status:
            .: {}
            f:bootstrapReady: {}
            f:conditions: {}
            f:lastUpdated: {}
            f:observedGeneration: {}
            f:phase: {}
        manager: manager
        operation: Update
        time: "2023-02-16T11:12:10Z"
      name: clusterctl-upgrade-61xx5y-control-plane-4kvpv
      namespace: clusterctl-upgrade
      ownerReferences:
      - apiVersion: controlplane.cluster.x-k8s.io/v1beta1
        blockOwnerDeletion: true
        controller: true
        kind: KubeadmControlPlane
        name: clusterctl-upgrade-61xx5y-control-plane
        uid: 91a4ee37-bdcb-4455-9350-2058b1728ed3
      resourceVersion: "3392"
      uid: b4c3d76d-fd74-4306-a470-99c3cee7938b
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfig
          name: clusterctl-upgrade-61xx5y-control-plane-bkrkw
          namespace: clusterctl-upgrade
          uid: 34323a9d-c8f4-428a-b76c-c0440f0f5603
        dataSecretName: clusterctl-upgrade-61xx5y-control-plane-bkrkw
      clusterName: clusterctl-upgrade-61xx5y
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: DockerMachine
        name: clusterctl-upgrade-61xx5y-control-plane-jxw2w
        namespace: clusterctl-upgrade
        uid: 393302d9-af38-4005-9612-3d721c1f0671
      nodeDeletionTimeout: 10s
      version: v1.26.0
    status:
      bootstrapReady: true
      conditions:
      - lastTransitionTime: "2023-02-16T11:12:16Z"
        message: 1 of 2 completed
        reason: Bootstrapping
        severity: Info
        status: "False"
        type: Ready
      - lastTransitionTime: "2023-02-16T11:12:10Z"
        status: "True"
        type: BootstrapReady
      - lastTransitionTime: "2023-02-16T11:12:16Z"
        message: 1 of 2 completed
        reason: Bootstrapping
        severity: Info
        status: "False"
        type: InfrastructureReady
      - lastTransitionTime: "2023-02-16T11:12:10Z"
        reason: WaitingForNodeRef
        severity: Info
        status: "False"
        type: NodeHealthy
      lastUpdated: "2023-02-16T11:12:10Z"
      observedGeneration: 2
      phase: Provisioning

sbueringer · 2023-02-16T15:07:51Z

The diff which leads to a rollout is detected here:

cluster-api/controlplane/kubeadm/internal/filters.go

Line 241 in eca3d17

return reflect.DeepEqual(&machineConfig.Spec, kcpConfig)

Just impossible to tell from the YAML why reflect.DeepEqual detects a difference

sbueringer · 2023-02-16T16:09:25Z

Added a lot more logs. let's see if that's good enough to find something

sbueringer · 2023-02-16T16:13:44Z

/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-16T18:25:38Z

@furkatgofurov7 I found the issue, it's a fun one...

Context:

We introduced a new ImagePullPolicy field in KubeadmConfigSpec.InitConfiguration.NodeRegistration and KubeadmConfigSpec.JoinConfiguration.NodeRegistration
The ImagePullPolicy field has a default value which is applied via OpenAPI
This happens as soon as there is any kind of update on KCP the existing KubeadmConfig
When KCP checks for rollouts it compares the KubeadmConfigSpec in KCP with the existing KubeadmConfig

Example error case:

We install Cluster API v0.3
We create the cluster including KCP
Cluster API is upgraded to main
ImagePullPolicy is defaulted in the KubeadmConfig of the Machine. It is not defaulted in KCP
KCP controller reconcile detects a diff (InitConfiguration.NodeRegistration.ImagePullPolicy: IfNotPresent != ) and triggers a rollout

The solution is essentially to set the default value in a way that it doesn't make a difference when we calculate if we need a rollout or not

Solution:

Move the defaulting from OpenAPI to the default func: 998f5f3
Default func is called on both KCP and KubeadmConfig before diffing

sbueringer · 2023-02-16T18:28:45Z

@killianmuldoon ^^ that's an interesting reason why we can't do all defaulting via the OpenAPI schema 😂

sbueringer · 2023-02-17T05:59:53Z

/test pull-cluster-api-e2e-main

sbueringer · 2023-02-17T06:56:27Z

/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-17T07:38:39Z

/test pull-cluster-api-e2e-main

sbueringer · 2023-02-17T10:03:46Z

/test pull-cluster-api-e2e-main

sbueringer · 2023-02-17T10:21:21Z

/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-full-main

Found another issue, this time ImagePullPolicy defaulting triggers a rollout by the Cluster topology reconciler (ImagePullPolicy is not defaulted for the existing KubeadmConfigTemplate, but for the new/desired one => diff => rollout)

sbueringer · 2023-02-17T10:45:16Z

Other flake
/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-17T11:22:36Z

Other flake
/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-17T16:40:48Z

/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-20T05:53:24Z

/test pull-cluster-api-e2e-full-main

sbueringer · 2023-02-20T09:24:53Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T13:45:58Z

/test pull-cluster-api-e2e-informing-main

Signed-off-by: Stefan Büringer buringerst@vmware.com

sbueringer · 2023-02-20T14:19:13Z

/test pull-cluster-api-e2e-informing-main

k8s-ci-robot · 2023-02-20T14:26:18Z

@sbueringer: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-apidiff-main	`fec5bd3`	link	false	`/test pull-cluster-api-apidiff-main`
pull-cluster-api-test-main	`fec5bd3`	link	true	`/test pull-cluster-api-test-main`
pull-cluster-api-test-mink8s-main	`fec5bd3`	link	true	`/test pull-cluster-api-test-mink8s-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sbueringer · 2023-02-20T14:51:00Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T15:18:55Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T15:43:09Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T16:04:15Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T16:34:28Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T17:54:31Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-20T18:30:27Z

/test pull-cluster-api-e2e-informing-main

sbueringer · 2023-02-21T16:38:59Z

/close

CI should be stable again

k8s-ci-robot · 2023-02-21T16:39:04Z

@sbueringer: Closed this PR.

In response to this:

/close

CI should be stable again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-02-21T16:39:08Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2023

k8s-ci-robot requested review from enxebre and stmcginnis February 16, 2023 05:59

sbueringer mentioned this pull request Feb 16, 2023

🌱 [WIP] [DNR] Reproduce clusterctl upgrade e2e test flake - parallel #8123

Closed

sbueringer mentioned this pull request Feb 16, 2023

🐛 KCP: fix rollout after upgrade #8125

Merged

sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch from 998f5f3 to 5f7adfb Compare February 16, 2023 18:43

sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch 2 times, most recently from d6d0c74 to 196d93c Compare February 17, 2023 16:37

sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch from 8ce0ae6 to 27f5fe1 Compare February 20, 2023 05:48

sbueringer added 11 commits February 20, 2023 15:18

Improve KCP logging, improve e2e test, increase timeout

ffaff8b

more logs

c33297c

run in parallel

bba7f2b

also run 1.3=>main in e2e-main

6f3fac8

Try to fix CC rollout issue

051eb53

Signed-off-by: Stefan Büringer buringerst@vmware.com

fix logging

916bdd7

fixups

27cb61e

fixup

05ed0e5

update

bef686e

more logs

17c489c

more logs

fec5bd3

sbueringer force-pushed the pr-debug-e2e-flake-clusterctl-upgrade branch from 319fddd to fec5bd3 Compare February 20, 2023 14:19

k8s-ci-robot closed this Feb 21, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2023

🌱 [WIP] [DNR] Reproduce clusterctl upgrade e2e test flake #8120

🌱 [WIP] [DNR] Reproduce clusterctl upgrade e2e test flake #8120

Conversation

sbueringer commented Feb 16, 2023

k8s-ci-robot commented Feb 16, 2023

sbueringer commented Feb 16, 2023

sbueringer commented Feb 16, 2023

sbueringer commented Feb 16, 2023

furkatgofurov7 commented Feb 16, 2023

sbueringer commented Feb 16, 2023

sbueringer commented Feb 16, 2023

furkatgofurov7 commented Feb 16, 2023

sbueringer commented Feb 16, 2023

sbueringer commented Feb 16, 2023

sbueringer commented Feb 16, 2023

sbueringer commented Feb 16, 2023 • edited Loading

sbueringer commented Feb 16, 2023 • edited Loading

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 17, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

k8s-ci-robot commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 20, 2023

sbueringer commented Feb 21, 2023

k8s-ci-robot commented Feb 21, 2023

k8s-ci-robot commented Feb 21, 2023

sbueringer commented Feb 16, 2023 •

edited

Loading

sbueringer commented Feb 16, 2023 •

edited

Loading