Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusterctl should enforce provider order during init and upgrade #5327

Closed
fabriziopandini opened this issue Sep 27, 2021 · 5 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Milestone

Comments

@fabriziopandini
Copy link
Member

Detailed Description

While testing clusterctl v1alpha3-->v1alpha4 upgrade we detected flakiness due to clusterctl upgrade upgrading a provider before than upgrading CAPI.

Error: failed to list objects for the "infrastructure.cluster.x-k8s.io/v1alpha4, Kind=AWSClusterControllerIdentity" GroupVersionKind: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=AWSClusterControllerIdentity failed: Post "https://capa-webhook-service.capa-system.svc:443/convert?timeout=30s": dial tcp 10.96.122.159:443: connect: connection refused

The web hook failure was caused by the provider controller trying to use version of core types not yet installed due to the upgrade order.

E0927 10:21:26.680285       1 deleg.go:144] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Cluster\" in version \"cluster.x-k8s.io/v1alpha4\""  "kind"={"Group":"cluster.x-k8s.io","Kind":"Cluster"}

This error was not being detected in the CAPI test grid by chance, because without an explicit order being enforced, the order returned by List provider was being applied.

While investigating this issue, we also identified some problems in CAPA E2E, now being addressed.
However it should be great to have some more coverage on provider as well, which is something we will get for CAPA, CAPV, CAPZ as part of v1beta1 activities..

Anything else you would like to add:

A way to recover from this problem is to manually apply core-provider manifests for CAPI/target version and then restart upgrades.

The fix merged in main and it was already back ported in release-0.4 branch; this will require a release ASAP

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 27, 2021
@fabriziopandini
Copy link
Member Author

fix #5321
backport #5327

/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

fix #5321
backport #5327

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member Author

/milestone v1.0

@k8s-ci-robot k8s-ci-robot added this to the v1.0 milestone Sep 27, 2021
@fabriziopandini
Copy link
Member Author

/kind release-blocking

@k8s-ci-robot k8s-ci-robot added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Sep 27, 2021
@sbueringer
Copy link
Member

sbueringer commented Sep 27, 2021

Some additional context:

We actually also had this issue with CAPD, but it just never surfaced. In case of CAPD, the CAPD controller was upgraded and then all others and it recovered automatically after the other providers have been deployed.

The difference in CAPA was that CAPA has a cluster-wide resource called AWSClusterControllerIdentity with an existing resource (during the upgrade call).

So what did actually happen in the CAPA upgrade case:

  • CAPA provider was upgraded
  • CAPA didn't come up because the controller couldn't start because the new versions of the CAPI resource weren't deployed yet.
  • Upgrade of the next provider (I think it probably was kubeadm bootstrap) failed, because:
    • during deletion of the kubeadm bootstrap provider we're calling p.Proxy.ListResources
    • p.Proxy.ListResources lists all resources:
      • for namespaced resources only in the provider namespace
      • for cluster-wide resources cluster-wide
    • so it called list on the AWSClusterControllerIdentity resource and because there actually was an instance of this resource the conversion webhook was called
    • conversion webhook failed because the CAPA provider was down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Projects
None yet
Development

No branches or pull requests

3 participants