Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: UserAssignedIdentities and FederatedIdentityCredentials are not able to sync since v1.0.0 #740

Open
1 task done
gravufo opened this issue May 14, 2024 · 2 comments
Labels
bug Something isn't working needs:triage

Comments

@gravufo
Copy link
Contributor

gravufo commented May 14, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

  • managedidentity.azure.upbound.io/v1beta1 - UserAssignedIdentity
  • managedidentity.azure.upbound.io/v1beta1 - FederatedIdentityCredentials

Resource MRs required to reproduce the bug

apiVersion: managedidentity.azure.upbound.io/v1beta1
kind: UserAssignedIdentity
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/<redacted>/resourceGroups/rg-dev/providers/Microsoft.ManagedIdentity/userAssignedIdentities/msi-dev
  name: msi-dev
spec:
  forProvider:
    location: eastus
    name: msi-dev
    resourceGroupName: rg-dev
  managementPolicies:
  - Observe
  providerConfigRef:
    name: default

Steps to Reproduce

Apply >1000 UserAssignedIdentities in Observe mode and let them get synced and ready using version v0.42.0.
Then, upgrade the provider to v1.0.0 (or later) and watch the objects start becoming unsynced.

What happened?

We are getting a lot of errors with context deadline exceeded such as this:
image

Also, we can see the Synced state of the objects dropping heavily and not being able to recover:
image

Note that the FederatedIdentityCredentials also seem to be affected.
We did not see this behavior on a small scale (<10 objects) but consistently when the scale is in the thousands.

Relevant Error Output Snippet

No response

Crossplane Version

v1.15.2

Provider Version

v1.1.0

Kubernetes Version

v1.28.5

Kubernetes Distribution

AKS

Additional Info

I had created a thread in Slack here: https://crossplane.slack.com/archives/C019VE11LJJ/p1711905230102149
It may disappear if there is retention.

@gravufo gravufo added bug Something isn't working needs:triage labels May 14, 2024
@gravufo
Copy link
Contributor Author

gravufo commented Jul 25, 2024

More information:

  • We tried playing with --max-reconcile-rate, but ultimately we could not find a value that worked properly. Setting it too low just makes it inherently impossible to sync all objects (due to the sheer amount of resources it has to sync) and setting it too high just makes it fail all resources faster.
  • Digging further, we seem to have found that the context deadline exceeded error we would see in the logs is related to the reconcileTimeout and reconcileGracePeriod as can be seen here: https://github.com/crossplane/crossplane-runtime/blob/1e7193e9c065f7f5ceef465a824e111174464687/pkg/reconciler/managed/reconciler.go#L47C2-L47C40
    • We think that when our issue happens, we are hitting rate limiting with the Azure API (hard to prove since we don't get the real error). The underlying azurerm terraform provider uses the official Azure SDK for Go which handles API rate limiting by respecting the 429's retry-after header and thus tries to do the call again after the specified time. This seems to make it so that the reconcile ends up busting the hardcoded limits set by reconcileTimeout and reconcileGracePeriod thus causing a context deadline exceeded error bubbling up and causing Synced state to turn to false.

On our side, we have created our own custom provider using the Azure SDK for Go directly and have implemented an optimisation (spec hash + save last external reconcile time) to reduce the quantity of external calls to a strict minimum in order to reduce the chance of hitting external rate limiting.
We can see the results here:

MR states (left is our custom provider, right is the provider-upjet-azure)
We can see that the new provider reconciles everything from scratch quite fast, whereas the upjet provider drops the Synced state quite fast on a pod restart and then struggles to recover.
image

The first screenshot below is the work queue depth of provider-upjet-azure and the second is the same thing for our custom provider. We can see that our custom provider gets through its queue quite fast while the upjet provider seems to constantly have a hard time getting through its queue, especially after the pod restart (around 12).
image
image

Overall, I think the main point here would be to figure out how external rate limiting is handled in this provider and/or upjet and seeing if there's a better way of handling it.

Hope this helps pinpoint the issue a little more.

@jbw976
Copy link
Member

jbw976 commented Jul 25, 2024

Related crossplane-runtime issue: crossplane/crossplane-runtime#696

Thanks again for all this data and insight @gravufo! 🙇‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

No branches or pull requests

2 participants