Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS clusters fail to create - they now need a verified domain of the organization #298

Closed
vfarcic opened this issue Oct 20, 2021 · 11 comments · Fixed by #314
Closed

AKS clusters fail to create - they now need a verified domain of the organization #298

vfarcic opened this issue Oct 20, 2021 · 11 comments · Fixed by #314
Assignees
Labels
bug Something isn't working

Comments

@vfarcic
Copy link

vfarcic commented Oct 20, 2021

What happened?

When creating an AKS cluster, the following events are recorded in the akscluster description.

  Warning  CannotCreateExternalResource  89s (x12 over 97s)  managed/akscluster.compute.azure.crossplane.io  (combined from similar events): cannot create AKSCluster: graphrbac.ApplicationsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="Unknown" Message="Unknown service error" Details=[{"odata.error":{"code":"Request_BadRequest","date":"2021-10-20T00:26:05","message":{"lang":"en","value":"Values of identifierUris property must use a verified domain of the organization or its subdomain: 'https://xyz.aks.crossplane.io'"},"requestId":"017725b3-d03e-4e26-9a3c-04b67ae2b619","values":[{"item":"PropertyName","value":"identifierUris"},{"item":"PropertyErrorCode","value":"HostNameNotOnVerifiedDomain"},{"item":"HostName","value":"https://xyz.aks.crossplane.io"}]}}]

How can we reproduce it?

The output that follows is from kubectl get akscluster xyz -o yaml. It can be used to reproduce the issue.

apiVersion: compute.azure.crossplane.io/v1alpha3
kind: AKSCluster
metadata:
  annotations:
    crossplane.io/composition-resource-name: aks
    crossplane.io/external-name: xyz
  creationTimestamp: "2021-10-20T00:24:20Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generateName: xyz-
  generation: 2
  labels:
    crossplane.io/claim-name: ""
    crossplane.io/claim-namespace: ""
    crossplane.io/composite: xyz
  name: xyz
  ownerReferences:
  - apiVersion: devopstoolkitseries.com/v1alpha1
    controller: true
    kind: CompositeCluster
    name: xyz
    uid: 297cca23-106c-43c1-982b-d80b3dc1cea5
  resourceVersion: "58359"
  uid: e9b7fbfb-567f-4ede-8538-04a2eca89dbb
spec:
  dnsNamePrefix: dot
  location: eastus
  nodeCount: 3
  nodeVMSize: Standard_D2_v2
  providerConfigRef:
    name: default
  resourceGroupName: xyz
  version: 1.20.7
status:
  conditions:
  - lastTransitionTime: "2021-10-20T00:24:21Z"
    reason: Creating
    status: "False"
    type: Ready
  - lastTransitionTime: "2021-10-20T00:28:57Z"
    message: 'create failed: cannot create AKSCluster: graphrbac.ApplicationsClient#Create:
      Failure responding to request: StatusCode=400 -- Original Error: autorest/azure:
      Service returned an error. Status=400 Code="Unknown" Message="Unknown service
      error" Details=[{"odata.error":{"code":"Request_BadRequest","date":"2021-10-20T00:30:25","message":{"lang":"en","value":"Values
      of identifierUris property must use a verified domain of the organization or
      its subdomain: ''https://xyz.aks.crossplane.io''"},"requestId":"0a42de07-5487-42be-b951-831e3e57f10b","values":[{"item":"PropertyName","value":"identifierUris"},{"item":"PropertyErrorCode","value":"HostNameNotOnVerifiedDomain"},{"item":"HostName","value":"https://xyz.aks.crossplane.io"}]}}]'
    reason: ReconcileError
    status: "False"
    type: Synced
  endpoint: ""

What environment did it happen in?

  • Cloud provider or hardware configuration

Azure

  • Kubernetes version (use kubectl version)
clientVersion:
  buildDate: "2021-09-15T21:31:32Z"
  compiler: gc
  gitCommit: 8b5a19147530eaac9476b0ab82980b4088bbc1b2
  gitTreeState: clean
  gitVersion: v1.22.2
  goVersion: go1.16.8
  major: "1"
  minor: "22"
  platform: darwin/amd64
serverVersion:
  buildDate: "2021-10-05T19:59:14Z"
  compiler: gc
  gitCommit: 724ef700bab896ff252a75e2be996d5f4ff1b842
  gitTreeState: clean
  gitVersion: v1.21.5+k3s2
  goVersion: go1.16.8
  major: "1"
  minor: "21"
  platform: linux/amd64
  • Kubernetes distribution (e.g. Tectonic, GKE, OpenShift)

AKS

  • OS (e.g. from /etc/os-release)

macOS

  • Kernel (e.g. uname -a)
Darwin Viktors-iMac.local 20.6.0 Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 x86_64
@vfarcic vfarcic added the bug Something isn't working label Oct 20, 2021
@dougsanders
Copy link

I too am running in to this issue. It might be important to note that currently this is only associated with using Single Tenant app registration. However, Azure will start restricting Multi-Tenant app registrations Nov. 9th.

@jbdell
Copy link

jbdell commented Nov 15, 2021

I too a running into this issue. This does seem related to potential changes in Azure's Graph API. We've been running an older version of Crossplane (v.0.14.0). Even though all processes (Crossplane and Azure provider) are still on that version, we receive the same error which halts the creation of the AKSCluster resource.

Also confirmed this issue with the latest version of Crossplane and Azure-Provider

@ulucinar
Copy link
Collaborator

Looks like the identifier URI that we pass here is no longer a valid one due to the recent changes mentioned above.

@ulucinar
Copy link
Collaborator

ulucinar commented Nov 18, 2021

According to Azure AD application registration security best practices docs, if we use the https scheme for an App ID URI with a domain other than onmicrosoft.com, it needs to be a verified custom domain. My understanding is that you need to register a custom domain name and using DNS TXT records, you need to have it verified by Azure in that case.

However, the api scheme does not require such validation and I did a simple change which changes the scheme of the App ID URI we are using from https to api. This resolves the initial issue of HTTP 400 - Values of identifierUris property must use a verified domain of the organization or its subdomain.

However, my observation is that the Azure API behaves differently apart from now doing domain name verification on App ID URIs. While debugging, I have had success in deploying an AKS cluster with that change but when I run the provider, initial attempts of role assignments to the provisioned service principal fail stating that the target service principal does not exist. That principal is provisioned by the provider here. And the Create call returns no errors. I think it takes some time for it to be available, and during that period, role assignment calls fail. Eventually, role assignment succeeds but this time AKS cluster creation call may fail stating that the service principal profile's secret specified here is not valid. During each reconciliation, if we deduce that the cluster is not in "Creating" state, we update the associated application's shared secret.

If I properly time application/service principal/role assignment/cluster provisioning via debugging, the cluster can successfully be provisioned. I assume this eliminates any issues with the api scheme I'm using. If that were an invalid configuration, I would not expect to be able to successfully provision clusters with that configuration.

This looks more like a race condition in the provider to me. However, given the high frequency I observed in a small (4-5) number of trials, I do suspect that, apart from the domain validation of identifier URIs, there are some further behavior changes on the Azure side (specifically in the Microsoft graph/authorization services) that made the current implementation susceptible to race conditions.

One theory is that we update the application's secret after a successful AKS cluster Create call, in which we specify the application's shared secret. If we successfully share the application's secret with the AKS cluster in a successful cluster Create call, but then successfully update the application secret, the cluster may not be able to use the specified service principal.

I have also observed some cases in which the provider is stuck trying to provision the role assignment, which has in fact been already provisioned. In fact, the provider first checks for the role assignment and looks like it does not attempt to assign the role to the service principal if the assignment already exists, so these errors were not expected. Does this mean that List call returns empty set but subsequent Create calls for the role assignment fail with HTTP 409s (already exists errors)?

@ulucinar
Copy link
Collaborator

ulucinar commented Nov 18, 2021

I suspect, to prevent the seemingly increased-frequency race condition on the application secret, we could store the generated shared secret in a K8s secret and reuse it in subsequent reconciliations. We have done something similar to address a PostgreSQL issue previously. But not sure about the HTTP 409's we get for the role assignments (described above). But maybe that's a temporary one, meaning that eventually the RoleAssignments.ListForScopeComplete call returns a non-empty set.

cc. @negz, @sergenyalcin

@e9169
Copy link

e9169 commented Nov 18, 2021

@ulucinar could you please share how can we make the change to api instead of https as a temporary workaround? I'm stuck with the creation of new clusters. Thanks!

@jbw976 jbw976 changed the title Verified domain of the organization AKS clusters fail to create - they now need a verified domain of the organization Nov 18, 2021
@jbw976
Copy link
Member

jbw976 commented Nov 18, 2021

@haarchri shared this link in the community meeting that may be useful about the breaking Azure change that is causing this:

https://thewindowsupdate.com/2021/10/14/azure-active-directory-breaking-change-impacting-azure-cli-and-azure-powershell/

@ulucinar
Copy link
Collaborator

ulucinar commented Nov 18, 2021

@ulucinar could you please share how can we make the change to api instead of https as a temporary workaround? I'm stuck with the creation of new clusters. Thanks!

Hi @e9169,
Changing the scheme from https (which requires custom domain verification) to api (which does not) is replacing this line as:

url := fmt.Sprintf("api://%s.aks.crossplane.io", name)

, and rebuilding the provider.

As discussed in the above comment, this addresses the initial problem of the REST API returning an HTTP 400 stating that custom domain verification fails. However, it looks like we have some other issues to be addressed as discussed above.

If you get a chance to try it, I'd love to hear about your results. With that change, I was able to provision clusters while debugging the provider but when running it, I hit other issues, which I'd like to investigate.

@e9169
Copy link

e9169 commented Nov 19, 2021

Would it be possible to enable custom domains in the call aks.go does? The problem is that the uri being sent is like https://cluster.aks.azure.crossplane.io or something like that and obviously that don't exist. But for those who have a custom domain in place could be as simple as editing a field in the yaml if this was enabled.

@e9169
Copy link

e9169 commented Nov 19, 2021

@ulucinar could you please share how can we make the change to api instead of https as a temporary workaround? I'm stuck with the creation of new clusters. Thanks!

Hi @e9169, Changing the scheme from https (which requires custom domain verification) to api (which does not) is replacing this line as:

url := fmt.Sprintf("api://%s.aks.crossplane.io", name)

, and rebuilding the provider.

As discussed in the above comment, this addresses the initial problem of the REST API returning an HTTP 400 stating that custom domain verification fails. However, it looks like we have some other issues to be addressed as discussed above.

If you get a chance to try it, I'd love to hear about your results. With that change, I was able to provision clusters while debugging the provider but when running it, I hit other issues, which I'd like to investigate.

I made the change, built the provider but it is not generating anything inside cluster/local apart from a testing script. The helm chart is not created although the docker images are being created. I pushed the image to minikube and tried to modify the deployment to use my build but it keeps replacing that image with the stable one, I guess it's because of this package:

NAME INSTALLED HEALTHY PACKAGE AGE

provider.pkg.crossplane.io/provider-azure True True crossplane/provider-azure:master 43h

@ulucinar
Copy link
Collaborator

Would it be possible to enable custom domains in the call aks.go does? The problem is that the uri being sent is like https://cluster.aks.azure.crossplane.io or something like that and obviously that don't exist. But for those who have a custom domain in place could be as simple as editing a field in the yaml if this was enabled.

Hi @e9169,
I think we should capture this with a feature request in this repo. I believe, because the service principal associated with the graph Application we are provisioning is intended to be only used by the AKS cluster, we have hardcoded an App ID URI in the code.

ulucinar added a commit to ulucinar/provider-azure that referenced this issue Dec 5, 2021
- Fixes crossplane-contrib#298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/provider-azure that referenced this issue Dec 5, 2021
- Fixes crossplane-contrib#298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/provider-azure that referenced this issue Dec 8, 2021
- Fixes crossplane-contrib#298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
github-actions bot pushed a commit that referenced this issue Dec 10, 2021
- Fixes #298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
(cherry picked from commit 15af7ee)
github-actions bot pushed a commit that referenced this issue Dec 10, 2021
- Fixes #298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
(cherry picked from commit 15af7ee)
ulucinar added a commit that referenced this issue Dec 13, 2021
- Fixes #298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
(cherry picked from commit 15af7ee)
ulucinar added a commit that referenced this issue Dec 13, 2021
- Fixes #298

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
(cherry picked from commit 15af7ee)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants