Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade-pod keep restarting even though node is successfully upgraded #72

Open
Kun483 opened this issue Oct 23, 2024 · 12 comments · May be fixed by #73
Open

upgrade-pod keep restarting even though node is successfully upgraded #72

Kun483 opened this issue Oct 23, 2024 · 12 comments · May be fixed by #73

Comments

@Kun483
Copy link

Kun483 commented Oct 23, 2024

What steps did you take and what happened:
I deployed a 3 CP 2 Workers cluster in v1.27. Then I triggered the InPlaceUpgrade. upgrade-pod pod is created in the default namespace of the 1st CP node. Then, this CP node is successfully upgraded to v1.28, and then the corresponding machine got upgraded as well. However, after that, upgrade-pod got stuck in CrashLoopBackOff -> completed -> CrashLoopBackOff with more than 20 restarts. When describing upgrade-pod, it gives me the error below:

Type     Reason   Age                  From     Message
----     ------   ----                 ----     -------
Warning  BackOff  84s (x347 over 76m) kubelet  Back-off restarting failed container upgrade in pod upgrade-pod_default

logs show:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   200  100   155  100    45   1393    404 --:--:-- --:--:-- --:--:--  1801
Stream closed EOF for default/upgrade-pod (upgrade)

What did you expect to happen:
After the first CP node got upgraded to the desired version, upgrade-pod should be deleted from that CP node and re-deployed to the next node.

Anything else you would like to add:
I think the code for deleting upgrade-pod never got executed during 1InPlaceUpgrade1. cluster-api-control-plane-provider-microk8s/controllers/reconcile.go at ac3d9e3e8da1b9eb9db0424e355018e4b8faa1b6 · canonical/cluster-api-control-plane-provider-microk8s

1.7297246802853289e+09	INFO	Waiting for upgrade node to be updated to the given version...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane"}
1.7297246902987778e+09	INFO	Now updating machine kun-microk8s-cp-9vfgb version to v1.28.0...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane"}
1.7297247003208792e+09	INFO	attempting to set control plane status	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane"}
1.7297247003444061e+09	INFO	ready replicas	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane"}
1.7297247003566322e+09	INFO	successfully updated control plane status	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane"}
1.729724700356677e+09	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x146c42e]

Env:
MicroK8s Control Plane and Bootstrap provider: v0.6.10
Upgrade from v1.27 to v.1.28
Infra: MaaS bare Metal for both CP and worker nodes
OS: Ubuntu
UpgradeStrategy: InPlaceUpgrade

@HomayoonAlimohammadi
Copy link

Hi @Kun483! Thanks for reporting this. May I ask for a broader scope of logs from the control plane provider? We'll start looking into this and report any further findings here.

@HomayoonAlimohammadi
Copy link

So this is the logs from the upgrade pod. I selected a few highlights to ensure no sensitive information is included:

time="2024-10-24T08:12:24Z" level=info msg="Fetching manifest of service palette and version '4.6.0-dev' for action apply"
time="2024-10-24T08:12:24Z" level=error msg="Failed to fetch manifest for service 'palette' and version 4.6.0-dev for action 'apply'.  &{Code:RequestError Details:<nil> Message:Get Request failure for 'https://scar-dev.dev.spectrocloud.com/roar/palette/4.6/4.6.0-dev/apply/manifest.yaml' with msg GET with response '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message>
time="2024-10-24T08:12:24Z" level=error msg="Failed to get manifests  &{Code:RequestError Details:<nil> Message:Get Request failure for 'https://scar-dev.dev.spectrocloud.com/roar/palette/4.6/4.6.0-dev/apply/manifest.yaml' with msg GET with response '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message>
time="2024-10-24T08:12:24Z" level=error msg="Failed to get backup manifest  &{Code:RequestError Details:<nil> Message:Get Request failure for 'https://scar-dev.dev.spectrocloud.com/roar/palette/4.6/4.6.0-dev/apply/manifest.yaml' with msg GET with response '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message>
time="2024-10-24T08:12:24Z" level=error msg="Failed to get and persist manifests  &{Code:RequestError Details:<nil> Message:Get Request failure for 'https://scar-dev.dev.spectrocloud.com/roar/palette/4.6/4.6.0-dev/apply/manifest.yaml' with msg GET with response '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message>
panic:  &{Code:RequestError Details:<nil> Message:Get Request failure for 'https://scar-dev.dev.spectrocloud.com/roar/palette/4.6/4.6.0-dev/apply/manifest.yaml' with msg GET with response '<?xml version="1.0" encoding="UTF-8"?>
        <Error><Code>AccessDenied</Code><Message>Access Denied</Message>

goroutine 1 [running]:
main.handleErr({0x2cb2500, 0xc0000bf160})
        /workspace/services/upgrader/cmd/main.go:52 +0x53
main.main()
        /workspace/services/upgrader/cmd/main.go:29 +0x1c5

Looks like it fails to fetch things from the given endpoint?

@sadysnaat
Copy link

@HomayoonAlimohammadi I think you should be checking the logs from pod named upgrade-pod and not upgrade pod. As the former is created by canonical for inPlace upgrade

. upgrade pod is internal to SpectroCloud.

Please let me know if any help is needed to get the correct pod logs.

@HomayoonAlimohammadi
Copy link

@sadysnaat Thanks for correcting me. You're right. I'm going to try to reproduce this issue.

@HomayoonAlimohammadi
Copy link

I tried an in-place upgrade on OB40 with Microk8s v1.27 and 3 control plane nodes.
Looks like it was successfully finished (Microk8s 1.27 -> 1.28).
For doing the upgrade, I added upgradeStrategy: "InPlaceUpgrade" to the MicroK8sControlPlane object and changed version: v1.27.0 to version: v1.28.0.
Here are the logs:

1.729762426624453e+09	INFO	Creating upgrade pod on lxdvm06...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297624266963172e+09	INFO	Waiting for upgrade node to be updated to the given version...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297627373486044e+09	INFO	Now updating machine hue-capi-cluster-control-plane-x6zjg version to v1.28.0...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297627473781958e+09	INFO	Removing upgrade pod upgrade-pod from lxdvm06...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297627574792736e+09	INFO	Upgrade of node lxdvm06 completed.
	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297627574895413e+09	INFO	Creating upgrade pod on lxdvm00...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297627575398433e+09	INFO	Waiting for upgrade node to be updated to the given version...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297628376370156e+09	INFO	Now updating machine hue-capi-cluster-control-plane-whb75 version to v1.28.0...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.729762847720543e+09	INFO	Removing upgrade pod upgrade-pod from lxdvm00...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297628577571416e+09	INFO	Upgrade of node lxdvm00 completed.
	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297628577647007e+09	INFO	Creating upgrade pod on lxdvm02...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297628577974675e+09	INFO	Waiting for upgrade node to be updated to the given version...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.729762919521166e+09	INFO	Now updating machine hue-capi-cluster-control-plane-zx9sq version to v1.28.0...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297629295340488e+09	INFO	Removing upgrade pod upgrade-pod from lxdvm02...	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}
1.7297629396371915e+09	INFO	Upgrade of node lxdvm02 completed.
	{"controller": "microk8scontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "MicroK8sControlPlane", "MicroK8sControlPlane": {"name":"hue-capi-cluster-control-plane","namespace":"default"}, "namespace": "default", "name": "hue-capi-cluster-control-plane", "reconcileID": "4275f607-a069-4fd7-a30b-5d5a92cb8da5", "cluster": "hue-capi-cluster", "desired": 3, "existing": 3}

As a further details:

  • Management cluster is running on node06 (inspect: k8s kubectl get pods -A)
  • Workload machines are lxdvm00, lxdvm02 and lxdvm06
    The control plane provider is still running in the management cluster (node06). Please feel free to retrieve and analyze the logs if needed with k8s kubectl logs capi-microk8s-control-plane-controller-manager-6cb6fdd755-md4dm -n capi-microk8s-control-plane-system

@JPedro2
Copy link

JPedro2 commented Oct 24, 2024

@HomayoonAlimohammadi Our environment is fully airgapped and with a total of 5 bare-metal nodes, no LXD involved. Also with disableDefaultCNI=true and Cilium v1.15.3 installed as a helm-chart. Can you please test under the same conditions?

Below are the microk8s pre-loaded images and Pre/PostRun commands that we use under the same environment:

tracking:     1.28/stable
channels:
  1.28/stable:    v1.28.14 2024-09-24 (7231) 186MB classic

images:
  - image: docker.io/cdkbot/hostpath-provisioner:1.4.2
  - image: docker.io/coredns/coredns:1.10.1
  - image: docker.io/busybox:1.28.4
  - image: registry.k8s.io/metrics-server/metrics-server:v0.6.3
  - image: registry.k8s.io/pause:3.7
  - image: docker.io/curlimages/curl:7.87.0
microk8sConfig:
  addons:
    - dns
  upgradeStrategy: InPlaceUpgrade
  snapstoreProxyScheme: "https"
  snapstoreProxyDomain: "<redacted>"
  snapstoreProxyId: "<redacted>"
  preRunCommands:
    - |
      # import certificate if using snap store proxy with https and using private CA/cert
      cat > /usr/local/share/ca-certificates/pcr_ca.crt << EOF
      -----BEGIN CERTIFICATE-----
      <redacted>
      -----END CERTIFICATE-----
      EOF
      sudo update-ca-certificates

  postRunCommands:
    - |
      mkdir -p /var/snap/microk8s/current/args/certs.d/custom_ca
      cat > /var/snap/microk8s/current/args/certs.d/custom_ca/ca.crt << EOF
      -----BEGIN CERTIFICATE-----
      <redacted>
      -----END CERTIFICATE-----
      EOF

      # create registry mirror for docker.io
      cat <<EOF > /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml
      server = "<redacted>"
      [host."<redacted>"]
      capabilities = ["pull", "resolve"]
      override_path = true
      ca = "/var/snap/microk8s/current/args/certs.d/custom_ca/ca.crt"
      EOF

      # Repeat this for each additional registry. No need to create each file specifying its content, this will be cleaner
      # Create registry mirror for gcr.io
      mkdir -p /var/snap/microk8s/current/args/certs.d/gcr.io/
      cp /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml /var/snap/microk8s/current/args/certs.d/gcr.io/hosts.toml

      # create registry mirror for ghcr.io
      mkdir -p /var/snap/microk8s/current/args/certs.d/ghcr.io/
      cp /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml /var/snap/microk8s/current/args/certs.d/ghcr.io/hosts.toml

      # Create registry mirror for k8s.gcr.io
      mkdir -p /var/snap/microk8s/current/args/certs.d/k8s.gcr.io/
      cp /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml /var/snap/microk8s/current/args/certs.d/k8s.gcr.io/hosts.toml

      # Create registry mirror for registry.k8s.io
      mkdir -p /var/snap/microk8s/current/args/certs.d/registry.k8s.io/
      cp /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml /var/snap/microk8s/current/args/certs.d/registry.k8s.io/hosts.toml

      # Create registry mirror for quay.io
      mkdir -p /var/snap/microk8s/current/args/certs.d/quay.io/
      cp /var/snap/microk8s/current/args/certs.d/docker.io/hosts.toml /var/snap/microk8s/current/args/certs.d/quay.io/hosts.toml

@Kun483
Copy link
Author

Kun483 commented Oct 24, 2024

@HomayoonAlimohammadi
This is AWS InPlaceUpgrade for 1 CP cluster, which will make upgrade-pod keep restarting.
aws_inplace.txt

@HomayoonAlimohammadi
Copy link

@Kun483 Thanks for attaching the logs. Looks like not handling an already created upgrade-pod is an issue on our side which we'll start working towards a fix for. But the crash loop issue of the upgrade-pod does not seem to be exactly related to our CAPI controllers. WDYT?
The upgrade-pod only runs a simple curl: https://github.com/canonical/cluster-api-control-plane-provider-microk8s/blob/main/controllers/reconcile.go#L714

@HomayoonAlimohammadi
Copy link

@Kun483 Would you please provide me with more info/logs of the upgrade-pod? Maybe using --previous?
I'll let you know when the "not handling already created upgrade-pod" issue is fixed.

@Kun483
Copy link
Author

Kun483 commented Oct 25, 2024

➜  Downloads kubectl logs upgrade-pod --previous
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   200  100   155  100    45    192     55 --:--:-- --:--:-- --:--:--   247
{"type":"error","status-code":400,"status":"Bad Request","result":{"message":"snap has no updates available","kind":"snap-no-update-available","value":""}}

@HomayoonAlimohammadi
Copy link

Thanks a lot @Kun483! I guess this means that the upgrade was successful and the upgrade-pod can be safely discarded. Will let you know when we have a fix for this.

@Kun483
Copy link
Author

Kun483 commented Oct 28, 2024

@HomayoonAlimohammadi Thanks in advance! I want to share another error regarding the HA cluster InPlaceUpgrade in MaaS.

Observations: When triggered the InPlaceUpgrade, upgrade-pod is scheduled in the 1st CP node, then API server will be temporarily down. connect: connection refused. Then, it will upgrade node version and machine version. After that, however, removing upgrade-pod will result in nil pointer dereference error. Then, in next reconcile of microk8s-control-plane-controller-manager, it will only check if the first-created machine the same major and minor version as MCP. It won't check the other machines' version. Also, it won't delete upgrade-pod on the first-created node.

maas_HA_Inplace_new.txt

Logs of upgrade-pod still show upgrade is successful:

➜  Downloads k logs upgrade-pod --previous
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   200  100   155  100    45   1463    424 --:--:-- --:--:-- --:--:--  1904
{"type":"error","status-code":400,"status":"Bad Request","result":{"message":"snap has no updates available","kind":"snap-no-update-available","value":""}}%

@HomayoonAlimohammadi HomayoonAlimohammadi linked a pull request Oct 31, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants