Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterBuilder never becomes ready on AWS ECR private registry #1511

Closed
georgethebeatle opened this issue Feb 1, 2024 · 2 comments · Fixed by #1512
Closed

ClusterBuilder never becomes ready on AWS ECR private registry #1511

georgethebeatle opened this issue Feb 1, 2024 · 2 comments · Fixed by #1512

Comments

@georgethebeatle
Copy link
Contributor

In Korifi we are trying to bump kpack to 0.13.1.

After bumping kpack to 0.13.1 in Korifi we noticed that the ClusterBuilder that our helm chart creates never becomes ready when deploying against a private Amazon ECR registry. With kpack 0.12.3 we do not see this problem.

The clusterbuilder status below indicates that the registry denies access to the kpack controller:

status:
  conditions:
  - lastTransitionTime: "2024-02-01T09:58:11Z"
    message: Builder has no latestImage
    reason: NoLatestImage
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-02-01T09:58:11Z"
    message: 'HEAD https://007801690126.dkr.ecr.eu-west-1.amazonaws.com/v2/eks-e2e-kpack-builder/manifests/latest:
      unexpected status code 401 Unauthorized (HEAD responses have no body, use GET
      for details)'
    reason: ReconcileFailed
    status: "False"
    type: UpToDate
  observedGeneration: 1
  stack: {}

Here is the related kpack-controller log:

{"level":"error","ts":"2024-02-01T09:58:11.770660587Z","logger":"controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"843bfcd","knative.dev/kind":"clusterbuilders.kpack.io","knative.dev/traceid":"f54d4504-c2ec-4657-a626-77dc7977af73","knative.dev/key":"cf-kpack-cluster-builder","duration":0.885850232,"error":"HEAD https://007801690126.dkr.ecr.eu-west-1.amazonaws.com/v2/eks-e2e-kpack-builder/manifests/latest: unexpected status code 401 Unauthorized (HEAD responses have no body, use GET for details)","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20230821102121-81e4ee140363/controller/controller.go:566\n
knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20230821102121-81e4ee140363/controller/controller.go:543\nknative.dev/pkg/controller.(*Im
pl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20230821102121-81e4ee140363/controller/controller.go:491"}

We are running the kpack controller with a serviceaccount that is mapped to a EKS role with the following policy:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Action": [
				"ecr:BatchCheckLayerAvailability",
				"ecr:BatchDeleteImage",
				"ecr:BatchGetImage",
				"ecr:CompleteLayerUpload",
				"ecr:CreateRepository",
				"ecr:GetAuthorizationToken",
				"ecr:GetDownloadUrlForLayer",
				"ecr:InitiateLayerUpload",
				"ecr:ListImages",
				"ecr:PutImage",
				"ecr:UploadLayerPart"
			],
			"Effect": "Allow",
			"Resource": "*"
		}
	]
}

We have had no issues with this role so far (we double checked it is working with kpack 0.12.3). We tried giving full ECR access to the serviceaccount by assigning ecr:* policy, but it made no difference. This made us think that the credentials are somehow not being picked up by the code.

In AWS the credentials are being injected in the pod environment. This injection is done by an aws webhook. This webhook would inspect the serviceaccount and if it is annotated with the eks.amazonaws.com/role-arn annotation it would inject the related aws credentials as env vars in all pods running with that serviceaccount. In our case this is the kpack controller pod. We can see that this does happen:

spec:
  containers:
  - env:
    ...
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: AWS_REGION
      value: eu-west-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::007801690126:role/eks-e2e-ecr_access
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

We suspected that the go-containerregistry dependency of kpack which got bumped as part of 0.13.1 might somehow be failing to propagate this information to ECR, so we tried downgrading it from 0.17.0 to 0.16.1 and rebuilding the kpack images. Unfortunately after replacing the controller image with the one we patched we observed the same behaviour.

@chenbh
Copy link
Contributor

chenbh commented Feb 1, 2024

Surprisingly, the true culprit is 1fcfca6 (hurrah git bisect). Still looking into why this broke it, but at least it gives us a starting place

@chenbh
Copy link
Contributor

chenbh commented Feb 2, 2024

This whole thing is caused by how AWS versions their SDKs aws/aws-sdk-go-v2#2370 (comment).

Because they use a different version per service, when one of their core libraries make a backwards incompatible change, the other libs needs to be bumped to interop with it. Also for whatever reason, their repo is named v2, but the release git tag is v1. Oh and when they do backwards incompatible changes, they don't bump the major as required by semver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants