Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes created by Karpenter are unable to pull images from a private Azure Container Registry (ACR), resulting in a 401 Unauthorized error #411

Open
ATymus opened this issue Jun 19, 2024 · 8 comments · Fixed by #456
Assignees
Labels
area/bootstrap Issues or PRs related to bootstrap area/security Issues or PRs related to security kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ATymus
Copy link

ATymus commented Jun 19, 2024

Version

Karpenter Version: v0.5.0

Kubernetes Version: v1.29.4

Expected Behavior

The expected behavior is that the nodes can access the private ACR using the configured managed identity.

Actual Behavior

Nodes created by Karpenter and regular Kubernetes nodes both have the same managed identity configured. This managed identity has been granted both AcrPull and AcrPush roles on the ACR. However, while pods on regular Kubernetes nodes can successfully pull images from the private ACR, pods on nodes created by Karpenter fail with the following error: 401 Unauthorized
Screenshot 2024-06-19 at 12 57 59

Steps to Reproduce the Problem

az aks update -n aks-dev -g rg-dev --attach-acr myregistry

Resource Specs and Logs

Events:
Type Reason Age From Message


Warning FailedScheduling 39m default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Normal Scheduled 37m default-scheduler Successfully assigned default/test-779d54dfd-djk7d to aks-general-purpose-zfxqd
Normal Nominated 39m karpenter Pod should schedule on: nodeclaim/general-purpose-zfxqd
Warning FailedCreatePodSandBox 37m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1d4cfc55733b95293627a57ffb7a20de269debdf1b2afc2116aeb103042afeb4": plugin type="cilium-cni" failed (add): failed to invoke delegated plugin ADD for IPAM: http request failed: Post "http://localhost:10090/network/requestipconfigs": dial tcp 127.0.0.1:10090: connect: connection refused; failed to request IP address from CNS
Normal SandboxChanged 36m (x5 over 37m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 36m (x3 over 36m) kubelet Pulling image "myregistry.azurecr.io/test-image:latest"
Warning Failed 36m (x3 over 36m) kubelet Failed to pull image "myregistry.azurecr.io/test-image:latest": failed to pull and unpack image "myregistry.azurecr.io/test-image:latest": failed to resolve reference "myregistry.azurecr.io/test-image:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3Atest-image%3Apull&service=myregistry.azurecr.io: 401 Unauthorized
Warning Failed 36m (x3 over 36m) kubelet Error: ErrImagePull
Warning Failed 35m (x5 over 36m) kubelet Error: ImagePullBackOff
Normal BackOff 2m29s (x152 over 36m) kubelet Back-off pulling image "myregistry.azurecr.io/test-image:latest"

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@danielhamelberg
Copy link

danielhamelberg commented Jul 4, 2024

@ATymus I recommend enabling the debug log level in Karpenter, redeploying and sharing more Resource Specs and Logs:
kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name>
az aks show --resource-group <resource-group> --name <aks-cluster> --query "identity"
az role assignment list --assignee <managed-identity-id> --scope <acr-id>
Also double-check the secret the pod is using to access the ACR.

@ATymus
Copy link
Author

ATymus commented Jul 11, 2024

I have debug mode enabled in Karpenter but no errors for this problem
kubectl describe pod <pod-name> -n <namespace>

Events:
Type Reason Age From Message


Normal SandboxChanged 60m (x5 over 60m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 59m (x3 over 60m) kubelet Pulling image "myregistry.azurecr.io/test-image:latest"
Warning Failed 59m (x3 over 60m) kubelet Failed to pull image "myregistry.azurecr.io/test-image:latest": failed to pull and unpack image "myregistry.azurecr.io/test-image:latest": failed to resolve reference "myregistry.azurecr.io/test-image:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://myregistry.azurecr.io/oauth2/token?scope=repository%3test-image%3Apull&service=myregistry.azurecr.io: 401 Unauthorized
Warning Failed 59m (x3 over 60m) kubelet Error: ErrImagePull
Warning Failed 58m (x5 over 60m) kubelet Error: ImagePullBackOff
Normal BackOff 45s (x261 over 60m) kubelet Back-off pulling image "myregistry.azurecr.io/test-image:latest"

kubectl describe node <node-name>

Events:
Type Reason Age From Message


Normal Starting 59m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 59m (x8 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 59m (x8 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 59m (x7 over 59m) kubelet Node aks-general-purpose-jgmcf status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 59m kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 59m node-controller Node aks-general-purpose-jgmcf event: Registered Node aks-general-purpose-jgmcf in Controller
Normal CreatedNNC 59m (x2 over 59m) dnc-rc/node-reconciler Created NodeNetworkConfig aks-general-purpose-jgmcf
Normal Unconsolidatable 10m (x4 over 57m) karpenter Can't replace with a cheaper node

az aks show --resource-group <resource-group> --name <aks-cluster> --query "identity"

> {
>   "delegatedResources": null,
>   "principalId": "26***34",
>   "tenantId": "91***9",
>   "type": "SystemAssigned",
>   "userAssignedIdentities": null
> }
> 

az role assignment list --assignee <managed-identity-id> --scope <acr-id>


>  [
>   {
>     "condition": null,
>     "conditionVersion": null,
>     "createdBy": "a5***71",
>     "createdOn": "2024-05-24T07:52:02.216406+00:00",
>     "delegatedManagedIdentityResourceId": null,
>     "description": "",
>     "id": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry/providers/Microsoft.Authorization/roleAssignments/61***1b",
>     "name": "61***1b",
>     "principalId": "3b***1f",
>     "principalName": "49***39",
>     "principalType": "ServicePrincipal",
>     "resourceGroup": "test-dev",
>     "roleDefinitionId": "/subscriptions/3a***53/providers/Microsoft.Authorization/roleDefinitions/83***ec",
>     "roleDefinitionName": "AcrPush",
>     "scope": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry",
>     "type": "Microsoft.Authorization/roleAssignments",
>     "updatedBy": "a5***71",
>     "updatedOn": "2024-05-24T07:52:02.216406+00:00"
>   },
>   {
>     "condition": null,
>     "conditionVersion": null,
>     "createdBy": "a5***71",
>     "createdOn": "2024-05-24T07:52:02.577232+00:00",
>     "delegatedManagedIdentityResourceId": null,
>     "description": "",
>     "id": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry/providers/Microsoft.Authorization/roleAssignments/85***d5",
>     "name": "85***5",
>     "principalId": "3b***1f",
>     "principalName": "49***39",
>     "principalType": "ServicePrincipal",
>     "resourceGroup": "test-dev",
>     "roleDefinitionId": "/subscriptions/3a***53/providers/Microsoft.Authorization/roleDefinitions/7f***8d",
>     "roleDefinitionName": "AcrPull",
>     "scope": "/subscriptions/3a***53/resourceGroups/test-dev/providers/Microsoft.ContainerRegistry/registries/myregistry",
>     "type": "Microsoft.Authorization/roleAssignments",
>     "updatedBy": "a5***71",
>     "updatedOn": "2024-05-24T07:52:02.577232+00:00"
>   }
> ]

@JoeyC-Dev
Copy link

JoeyC-Dev commented Aug 7, 2024

@danielhamelberg Looks like the issue is reproducible. Both grant permission manually or using --attach-acr method are not working.
image

Permission is definitely there:
image

Using image pull secret should work as workaround, but this is using password-like credential and should not become intended way.

Provide the whole issue demo set-up here: (Please execute the command one by one, as I did not write the command to grant "Azure Kubernetes Service RBAC Cluster Admin" to the logged-in user.)

ranNum=$(echo $RANDOM)
rG=aks-auto-${ranNum}
aks=aks-auto-${ranNum}
acr=acrauto${ranNum}
location=southeastasia

az extension add --name aks-preview

az group create -n ${rG} -l ${location} -o none

# Specify "Standard_D8pds_v5" as this is the one in my sub can be created among 3 availability zones
az aks create -n ${aks} -g ${rG} --node-vm-size Standard_D8pds_v5 \
--sku automatic --no-ssh-key

az acr create --resource-group ${rG} --name ${acr} --sku Basic
az acr login --name ${acr}
docker pull nginx
docker tag nginx ${acr}.azurecr.io/nginx
docker push ${acr}.azurecr.io/nginx

kubeletObjID=$(az aks show -n ${aks} -g ${rG} --query identityProfile.kubeletidentity.objectId -o tsv)

acrResID=$(az resource show -n ${acr} -g ${rG} \
--namespace Microsoft.ContainerRegistry --resource-type registries --query id -o tsv)

az role assignment create --assignee-object-id ${kubeletObjID} \
--assignee-principal-type ServicePrincipal --role "AcrPull" --scope ${acrResID}

# Grant your own user as "Azure Kubernetes Service RBAC Cluster Admin" and I skip the CLI command for that here.

az aks get-credentials -n ${aks} -g ${rG} 

# Deploy test Pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: ${acr}.azurecr.io/nginx
    imagePullPolicy: IfNotPresent
EOF

# Wait for 3 mins for new node being provisioned and see the result
sleep 180;
kubectl describe po nginx
# Result: ImagePullErr

kubectl delete po nginx

# Try `--attach-acr` method, which is intended approach
az aks update -n ${aks} -g ${rG} --attach-acr ${acr}

# Deploy Pod again
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: ${acr}.azurecr.io/nginx
    imagePullPolicy: IfNotPresent
EOF

# Wait for 3 mins for new node being provisioned and see the result
sleep 180;
kubectl describe po nginx

# Still failed

Debug:
Go inside the node:

root [ / ]# crictl pull acrauto2462.azurecr.io/nginx
E0807 18:18:39.889144   21228 remote_image.go:180] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"acrauto2462.azurecr.io/nginx:latest\": failed to resolve reference \"acrauto2462.azurecr.io/nginx:latest\": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://acrauto2462.azurecr.io/oauth2/token?scope=repository%3Anginx%3Apull&service=acrauto2462.azurecr.io: 401 Unauthorized" image="acrauto2462.azurecr.io/nginx"
FATA[0000] pulling image: failed to pull and unpack image "acrauto2462.azurecr.io/nginx:latest": failed to resolve reference "acrauto2462.azurecr.io/nginx:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://acrauto2462.azurecr.io/oauth2/token?scope=repository%3Anginx%3Apull&service=acrauto2462.azurecr.io: 401 Unauthorized 

We can see using crictl also cannot help on pulling images.
I also tried to use Ubuntu sku but does not work.

At the time, I am realizing something. So I avoid use the node created by Karpenter and use system nodepool instead:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx-test
spec:
  nodeSelector:
    kubernetes.azure.com/agentpool: nodepool1
  tolerations:
    - key: CriticalAddonsOnly
      operator: Exists
  containers:
  - name: nginx
    image: ${acr}.azurecr.io/nginx
    imagePullPolicy: IfNotPresent
EOF

image
I don't know why there is error below. I also tried with busybox, but still cannot be created because the below (looks like deploying in system nodepool is not supposed somehow). But the point here is: the image is now can be pulled: only in system nodepool.

kubectl logs nginx-test
exec /docker-entrypoint.sh: exec format error

@comtalyst
Copy link
Collaborator

We have just discovered a suspect: DisableKubeletCloudCredentialProviders feature gate for kubelet is set to true by default beginning 1.29. Looks like we haven't made an appropriate response to that yet.

This overall issue seems to not be present on 1.28 as well (from my reproduction attempt, at least), further backing that claim.

Will give updates on the potential fix for this.

@comtalyst comtalyst added area/bootstrap Issues or PRs related to bootstrap area/security Issues or PRs related to security kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Aug 9, 2024
@Bryce-Soghigian
Copy link
Collaborator

https://github.com/Azure/karpenter-provider-azure/blob/main/pkg/providers/imagefamily/bootstrap/aksbootstrap.go#L425

// CredentialProviderURL returns the URL for OOT credential provider,
// or an empty string if OOT provider is not to be used
func CredentialProviderURL(kubernetesVersion, arch string) string {
  minorVersion := semver.MustParse(kubernetesVersion).Minor
  if minorVersion < 30 { 
    return ""
  }

Looks like from this code for out of tree provider we default to not including the settings for out of tree provider if CredentialProviderURL is empty.

DisableKubeletCloudCredentialProviders false Alpha 1.23 1.28
DisableKubeletCloudCredentialProviders true Beta 1.29

We don't have the logic conditionally enabled for 1.29, so auth pull will not work for that specific kubernetes version.

This can be easily fixed by

A) Switching 1.29 to use out of tree credential provider.(Have karpenter pass in the rest of the required OOT Provider kubelet flags.)
B) Adding the default to false for the feature gate for 1.29 clusters.

I believe its best we use option A.

@Bryce-Soghigian
Copy link
Collaborator

Merged in the fix, need to still release it so keeping this open for tracking

@vikas-rajvanshy
Copy link

@Bryce-Soghigian - do you know when this will be released? I am trying to triangulate if I should wait or go back to standard node pools for now.

@Bryce-Soghigian
Copy link
Collaborator

Bryce-Soghigian commented Sep 3, 2024

@vikas-rajvanshy Its in the current release that's rolling out. I believe you can track it via the aks release tracker https://releases.aks.azure.com/. The fix is a part of the v20240827 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Issues or PRs related to bootstrap area/security Issues or PRs related to security kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants