Clusterctl should enforce provider order during init and upgrade #5327

fabriziopandini · 2021-09-27T12:56:18Z

Detailed Description

While testing clusterctl v1alpha3-->v1alpha4 upgrade we detected flakiness due to clusterctl upgrade upgrading a provider before than upgrading CAPI.

Error: failed to list objects for the "infrastructure.cluster.x-k8s.io/v1alpha4, Kind=AWSClusterControllerIdentity" GroupVersionKind: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=AWSClusterControllerIdentity failed: Post "https://capa-webhook-service.capa-system.svc:443/convert?timeout=30s": dial tcp 10.96.122.159:443: connect: connection refused

The web hook failure was caused by the provider controller trying to use version of core types not yet installed due to the upgrade order.

E0927 10:21:26.680285       1 deleg.go:144] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Cluster\" in version \"cluster.x-k8s.io/v1alpha4\""  "kind"={"Group":"cluster.x-k8s.io","Kind":"Cluster"}

This error was not being detected in the CAPI test grid by chance, because without an explicit order being enforced, the order returned by List provider was being applied.

While investigating this issue, we also identified some problems in CAPA E2E, now being addressed.
However it should be great to have some more coverage on provider as well, which is something we will get for CAPA, CAPV, CAPZ as part of v1beta1 activities..

Anything else you would like to add:

A way to recover from this problem is to manually apply core-provider manifests for CAPI/target version and then restart upgrades.

The fix merged in main and it was already back ported in release-0.4 branch; this will require a release ASAP

/kind bug

The text was updated successfully, but these errors were encountered:

fabriziopandini · 2021-09-27T12:56:57Z

fix #5321
backport #5327

/close

k8s-ci-robot · 2021-09-27T12:57:07Z

@fabriziopandini: Closing this issue.

In response to this:

fix #5321
backport #5327

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabriziopandini · 2021-09-27T12:57:28Z

/milestone v1.0

fabriziopandini · 2021-09-27T12:57:42Z

/kind release-blocking

sbueringer · 2021-09-27T14:41:56Z

Some additional context:

We actually also had this issue with CAPD, but it just never surfaced. In case of CAPD, the CAPD controller was upgraded and then all others and it recovered automatically after the other providers have been deployed.

The difference in CAPA was that CAPA has a cluster-wide resource called AWSClusterControllerIdentity with an existing resource (during the upgrade call).

So what did actually happen in the CAPA upgrade case:

CAPA provider was upgraded
CAPA didn't come up because the controller couldn't start because the new versions of the CAPI resource weren't deployed yet.
Upgrade of the next provider (I think it probably was kubeadm bootstrap) failed, because:
- during deletion of the kubeadm bootstrap provider we're calling p.Proxy.ListResources
- p.Proxy.ListResources lists all resources:
  - for namespaced resources only in the provider namespace
  - for cluster-wide resources cluster-wide
- so it called list on the AWSClusterControllerIdentity resource and because there actually was an instance of this resource the conversion webhook was called
- conversion webhook failed because the CAPA provider was down

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 27, 2021

k8s-ci-robot closed this as completed Sep 27, 2021

k8s-ci-robot added this to the v1.0 milestone Sep 27, 2021

k8s-ci-robot added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Sep 27, 2021

randomvariable mentioned this issue Sep 27, 2021

Make e2e v1alpha3 to v1alpha4 upgrade actually work kubernetes-sigs/cluster-api-provider-aws#2805

Closed

vincepri mentioned this issue Sep 30, 2021

🌱 Source should retry to get informers until timeout expires kubernetes-sigs/controller-runtime#1678

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusterctl should enforce provider order during init and upgrade #5327

Clusterctl should enforce provider order during init and upgrade #5327

fabriziopandini commented Sep 27, 2021

fabriziopandini commented Sep 27, 2021

k8s-ci-robot commented Sep 27, 2021

fabriziopandini commented Sep 27, 2021

fabriziopandini commented Sep 27, 2021

sbueringer commented Sep 27, 2021 •

edited

Loading

Clusterctl should enforce provider order during init and upgrade #5327

Clusterctl should enforce provider order during init and upgrade #5327

Comments

fabriziopandini commented Sep 27, 2021

fabriziopandini commented Sep 27, 2021

k8s-ci-robot commented Sep 27, 2021

fabriziopandini commented Sep 27, 2021

fabriziopandini commented Sep 27, 2021

sbueringer commented Sep 27, 2021 • edited Loading

sbueringer commented Sep 27, 2021 •

edited

Loading