Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Conversion Webhook for containerservice.azure.upbound.io/v1beta1 sometimes fails #784

Open
1 task done
b-deam opened this issue Jul 17, 2024 · 3 comments
Open
1 task done
Labels
bug Something isn't working needs:triage

Comments

@b-deam
Copy link

b-deam commented Jul 17, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

kubernetesclusters.containerservice.azure.upbound.io/v1beta1

Resource MRs required to reproduce the bug

No response

Steps to Reproduce

What happened?

Occasionally the XR Synced status switches to False temporarily due to conversion webhook error(s). This status does not propagate upwards to the claim.

Relevant Error Output Snippet

Warning  ComposeResources  2m15s (x9 over 16m)    defined/compositeresourcedefinition.apiextensions.crossplane.io  cannot compose resources: cannot apply composed resource "aks_cluster": failed to prune fields: failed add back owned items: failed to convert pruned object at version containerservice.azure.upbound.io/v1beta2: conversion webhook for containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster returned invalid metadata: invalid metadata of type <nil> in input object

And beta trace:

$ crossplane beta trace -n claim-namespace myaks dev-aks-cluster-1 -o wide
NAME                                                                                RESOURCE                                    SYNCED   READY   STATUS
AKS/dev-aks-cluster-1 (dev-azure-eastus2)                                                                                     True     True    Available
└─ XAKS/dev-aks-cluster-1-mnxjr                                                                                               False    True    ReconcileError: cannot compose resources: cannot apply composed resource "aks_cluster": failed to prune fields: failed add back owned items: failed to convert pruned object at version containerservice.azure.upbound.io/v1beta2: conversion webhook for containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster returned invalid metadata: invalid metadata of type <nil> in input object
   ├─ XAKSNodepoolSet/dev-aks-cluster-1-nodepool-set                              nodepool_set                                True     True    Available
   │  ├─ KubernetesClusterNodePool/dev-aks-cluster-1-generalnp                    ng_generalnp                              True     True    Available
   │  └─ NodePoolCalculation/dev-aks-cluster-1-calc                               node_pool_calculation                       True     True    Available
   ├─ XPermission/dev-aks-cluster-1-ca-permission                                 cluster_autoscaler_permission               True     True    Available
   │  ├─ RoleAssignment/dev-aks-cluster-1-cluster-autoscaler-assignment           role_assignment                             True     True    Available
   │  ├─ RoleDefinition/dev-aks-cluster-1-cluster-autoscaler-definition           role_definition                             True     True    Available
   │  ├─ FederatedIdentityCredential/dev-aks-cluster-1-cluster-autoscaler-fedid   federated_identity                          True     True    Available
   │  └─ UserAssignedIdentity/dev-aks-cluster-1-cluster-autoscaler-identity       identity                                    True     True    Available
   ├─ KubernetesCluster/dev-aks-cluster-1-mnxjr                                   aks_cluster                                 True     True    Available
   ├─ EventHubNamespace/dev-aks-cluster-1-test                                     queue-dev-aks-cluster-1-test               True     True    Available
   ├─ EventHubNamespace/dev-aks-cluster-1-test-data                              queue-dev-aks-cluster-1-test-data        True     True    Available

Crossplane Version

1.14.3

Provider Version

1.3.0

Kubernetes Version

v1.28.9

Kubernetes Distribution

AKS

Additional Info

Hi, we were previously hitting #645 and the workaround/fix was to upgrade our MR:

- kubernetesclusters.containerservice.azure.upbound.io/v1beta1
+ kubernetesclusters.containerservice.azure.upbound.io/v1beta2

Since upgrading we're noticing that occasionally (and seemingly randomly) the XR Synced status flips to False due to a conversion webhook failure. This condition only lasts for tens of seconds before Synced then reverts back to True. Sometimes this can take over an hour to reoccur. This status change does not propagate upwards to the claim (as shown in the crossplane beta trace output).

I unfortunately don't have a minimal reproduction available as the environment/configuration displaying this behaviour is very complex, multiple pipeline steps etc and I haven't had much luck reproducing with an environment running in kind.

There are no interesting logs in the Crossplane, Provider, or Function pods (even with --debug enabled), or the AKS control plane.

My limited debugging led me to this Kubernetes bug: kubernetes/kubernetes#117356 - which seems plausible as I can see that the CRD only stores v1beta1:

Any ideas on where to look next?

@b-deam b-deam added bug Something isn't working needs:triage labels Jul 17, 2024
@b-deam
Copy link
Author

b-deam commented Jul 18, 2024

FWIW we saw this exact same bug with the Kubernetes provider and Objects that was solved by moving our resources to v1alpha2 (which is the stored version) from v1alpha1:
https://github.com/crossplane-contrib/provider-kubernetes/blob/5bfb71a932d71ada6e29b7bce4f2b4b8162f8ef9/package/crds/kubernetes.crossplane.io_objects.yaml#L865

To me, that suggests that we are indeed hitting kubernetes/kubernetes#117356.

I'd say that that if the KubernetesCluster stored version was moved to v1beta2 we wouldn't see this issue.

@b-deam
Copy link
Author

b-deam commented Jul 22, 2024

We recreated our claim (by deleting it and all associated XRs, MRs etc.) and the conversion webhook errors haven't reappeared for a number of days.

This chart shows the count of webhook conversion failures over a ~7 day period. You can see that the failures have stopped entirely around the time we deleted/recreated the claim.
image

It's worth noting that we had originally upgraded the MR from v1beta1 to v1beta2, so I'm not sure if recreating it at v1beta2 has anything to do with the lack of failures, but that seems unlikely to me as this error only appeared weeks after the initial claim creation.

If we see the error reappear, I'll update the issue.

@b-deam
Copy link
Author

b-deam commented Aug 13, 2024

Just adding here that we've seen this reappear ~3 weeks later. No obvious correlation between the errors reappearing and changes to our composition etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

No branches or pull requests

1 participant