Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterctl can fail idempotency when installing cert-manager during upgrade #10389

Closed
neolit123 opened this issue Apr 5, 2024 · 4 comments · Fixed by #10407
Closed

clusterctl can fail idempotency when installing cert-manager during upgrade #10389

neolit123 opened this issue Apr 5, 2024 · 4 comments · Fixed by #10407
Assignees
Labels
area/clusterctl Issues or PRs related to clusterctl kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@neolit123
Copy link
Member

neolit123 commented Apr 5, 2024

What steps did you take and what happened?

not necessary a blocking bug, but IIUC there is something that can be improved for the idempotency of clusterctl upgrades of certmanager. appreciated feedback on this as i was not able to reproduce it personally (downstream report) and i'm not too familiar with cert-manager objects.

reading the logic of clusterctl upgrade.

ApplyUpgrade is called:

func (c *clusterctlClient) ApplyUpgrade(ctx context.Context, options ApplyUpgradeOptions) error {

EnsureLatestVersion for certmanager is called:

if err := certManager.EnsureLatestVersion(ctx); err != nil {

then shouldUpgrade() is called:

currentVersion, _, shouldUpgrade, err := cm.shouldUpgrade(objs)

which determines if an upgrade is required, by looking at objects that have the version annotation

objVersion, ok := obj.GetAnnotations()[clusterctlv1.CertManagerVersionAnnotation]

if not it returns, if yes it calls install()

// Install all cert-manager manifests
createCertManagerBackoff := newWriteBackoff()
objs = utilresource.SortForCreate(objs)
for i := range objs {
o := objs[i]
// Create the Kubernetes object.
// Nb. The operation is wrapped in a retry loop to make ensureCerts more resilient to unexpected conditions.
if err := retryWithExponentialBackoff(ctx, createCertManagerBackoff, func(ctx context.Context) error {
return cm.createObj(ctx, o)
}); err != nil {
return err
}
}

if i'm understanding this right, if during the install() loop a certain creatObj fails, then EnsureLatestVersion() will exit in an error, which is OK.

however, this can leave the cert-manager install in a partial state, so if the user attempt to call ApplyUpgrade() again, shouldUpgrade() can return false because the version annotation could be there and up-to-date, however not all objects were installed.

before install() objects are deleted correctly
https://github.com/kubernetes-sigs/cluster-api/blob/main/cmd/clusterctl/client/cluster/cert_manager.go#L263-L269

proposed solution - check if all obj are installed before checking version.
not sure how feasible is that as it must "keep inventory" somehow

log.Info("Checking validity of cert-manager install...")
validInstall, err := cm.validateInstall()
if err != nil {
	return err
}
if validInstall {
	log.Info("Checking cert-manager version...")
	currentVersion, _, shouldUpgrade, err := cm.shouldUpgrade(objs)
	if err != nil {
		return err
	}
}

or other ideas are welcome.

What did you expect to happen?

to be possible for clusterctl to recover from partially installed cert-manager with idempotent retries.

Cluster API version

latest master

Kubernetes version

latest master

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
/area clusterctl

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/clusterctl Issues or PRs related to clusterctl labels Apr 5, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 5, 2024
@neolit123
Copy link
Member Author

i can work on this if we agree on a path forward.
/assign

@neolit123
Copy link
Member Author

after discussion with @chrischdi offline the code seems moslty idempotent.

however shouldUpgrade should ideally check the number of objects of the certManager version is the same as the version to be installed. this would prevent the case:

  • install was partial / there are missing objects
  • shouldInstall gets a version and returns false

@fabriziopandini
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants