Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: supporting 1.30 bootstrap via enabling out of tree credential provider #370

Merged
merged 16 commits into from
Jun 4, 2024

Conversation

Bryce-Soghigian
Copy link
Collaborator

Fixes # NA

Description
Bootstrap for 1.30 will fail if we have the --azure-container-registry-config flag specified.

azureuser@aks-general-purpose-78r6l:~$ journalctl -u kubelet.service
---
May 10 08:31:26 aks-general-purpose-78r6l kubelet[10604]: E0510 08:29:13.285779   10604 run.go:74]"command failed" err="failed to parse kubelet flag: unknown flag: --azure-container-registry-config"
May 10 08:31:26 aks-general-purpose-78r6l kubelet[13211]: E0510 08:31:26.037146   13211 run.go:74] "command failed" err="failed to parse kubelet flag: unknown flag: --azure-container-registry-config"
May 10 08:31:26 aks-general-purpose-78r6l systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
May 10 08:31:26 aks-general-purpose-78r6l systemd[1]: kubelet.service: Failed with result 'exit-code'.

By removing this flag we get nodes for k8s version 1.30 to bootstrap successfully.

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

support for 1.30 node bootstrapping with karpenter

Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test

@Bryce-Soghigian
Copy link
Collaborator Author

Copying feedback on the PR with messy git history
tallaxes left a comment •
This particular fix looks good to me, with some suggestions (left comments), and the following caveats:

There are likely more changes related to the switch to out-of-tree providers (such as setting DisableKubeletCloudCredentialProviders feature gate depending on Kubernetes version), would make sense to package and test this all together
Need to actually test it on 1.30 - including ACR image pull. (Ideally this would be covered by E2E)
Need evidence of E2E tests passing

@Bryce-Soghigian
Copy link
Collaborator Author

Bryce-Soghigian commented May 24, 2024

There are likely more changes related to the switch to out-of-tree providers (such as setting DisableKubeletCloudCredentialProviders feature gate depending on Kubernetes version), would make sense to package and test this all together

We discussed offline, seems the feature gate is removed and defaulted correctly in the kubelet binaries so no need for change here.

Need to actually test it on 1.30 - including ACR image pull. (Ideally this would be covered by E2E)
I ran the e2e test paired with setting the k8s version in the k8s version func in image family. IT fails with the following errors

Warning FailedCreatePodSandBox 2m42s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<sandbox_id>": plugin type="cilium-cni" failed (add): failed to invoke delegated plugin ADD for IPAM: http request failed: Post "http://<ipam_service_address>": dial tcp <local_ip>:: connect: connection refused; failed to request IP address from CNS
Normal SandboxChanged 98s (x6 over 2m41s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 44s (x3 over 98s) kubelet Pulling image "<image_registry>/pause:3.6"
Warning Failed 44s (x3 over 88s) kubelet Failed to pull image "<image_registry>/pause:3.6": failed to pull and unpack image "<image_registry>/pause:3.6": failed to resolve reference "<image_registry>/pause:3.6": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://<image_registry>/oauth2/token?scope=repository%3Apause%3Apull&service=<image_registry>: 401 Unauthorized
Warning Failed 44s (x3 over 88s) kubelet Error: ErrImagePull
Normal BackOff 20s (x4 over 87s) kubelet Back-off pulling image "<image_registry>/pause:3.6"
Warning Failed 20s (x4 over 87s) kubelet Error: ImagePullBackOff

I need to investigate why and if there is more needed to supporting the out of tree providers work.

Need evidence of E2E tests passing

I ran the e2e tests here https://github.com/Azure/karpenter-provider-azure/actions/runs/9228266167. Ideally we could run them against 1.30 but as its not out yet no real way to test on 1.30 for apiserver from our e2e tests. Best we can do is set the kubelet version to 1.30.

@coveralls
Copy link

coveralls commented May 24, 2024

Pull Request Test Coverage Report for Build 9324673471

Details

  • 10 of 13 (76.92%) changed or added relevant lines in 2 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 97.761%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/providers/imagefamily/bootstrap/aksbootstrap.go 7 10 70.0%
Files with Coverage Reduction New Missed Lines %
pkg/cache/unavailableofferings.go 2 95.45%
Totals Coverage Status
Change from base Build 9074172591: -0.01%
Covered Lines: 36284
Relevant Lines: 37115

💛 - Coveralls

@Bryce-Soghigian Bryce-Soghigian changed the title feat: supporting 1.30 bootstrap via removing kubelet flag --azure-container-registry-config for 1.30 onwards feat: supporting 1.30 bootstrap via enabling out of tree credential provider May 28, 2024
@Bryce-Soghigian Bryce-Soghigian self-assigned this May 28, 2024
@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review May 28, 2024 21:13
@Bryce-Soghigian Bryce-Soghigian marked this pull request as draft May 28, 2024 21:13
@Bryce-Soghigian
Copy link
Collaborator Author

Pasting this here for reference: https://github.com/Azure/AgentBaker/pull/4258/files @rakechill for Sov cloud support, we will need to pass in AKS_CUSTOM_CLOUD_CONTAINER_REGISTRY_DNS_SUFFIX="{{- if IsAKSCustomCloud}}{{AKSCustomCloudContainerRegistryDNSSuffix}}{{end}}" added in this pr. If we don't migrate to the contract in time for GA that is.

Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test

Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test

@Bryce-Soghigian
Copy link
Collaborator Author

I ran the end to end suite for ACR pull locally.

##[endgroup]
< Exit [AfterEach] TOP-LEVEL - /Users/sillygoose/dev/focus/karpenter-provider-azure/test/suites/acr/suite_test.go:43 @ 05/30/24 14:39:58.037 (45ms)
• [369.389 seconds]

Ran 1 of 1 Specs in 369.598 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAcr (369.60s)
PASS
ok github.com/Azure/karpenter-provider-azure/test/suites/acr 370.152s

We pass with nodes that are set to 1.30 for their k8s version. We should be good to get this one in if CI passes here.

@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review May 30, 2024 22:31
Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with minor suggestions

hack/toolchain.sh Outdated Show resolved Hide resolved
pkg/providers/imagefamily/bootstrap/aksbootstrap.go Outdated Show resolved Hide resolved
pkg/providers/instancetype/suite_test.go Outdated Show resolved Hide resolved
Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>
Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test

@Bryce-Soghigian Bryce-Soghigian merged commit 45e0c39 into Azure:main Jun 4, 2024
10 checks passed
Bryce-Soghigian added a commit that referenced this pull request Sep 12, 2024
…rovider (#370)

* feat: supporting 1.30 bootstrap via removing kubelet flag --azure-container-registry-config for 1.30 onwards

* feat: adding support for using out of tree credentials in karpenter

* test: adding additional validation in our testing for the flags for out of tree provider

* fix: adding back in cgroupsv2 from rebase

* fix: removing flag from base flags again

* fix: fixing tests to properly validate the kubelet flags

* fix: ci

* fix: adding CSE VAR to contract that was not cherry picked properly from other pr

* fix: 1.30 ci

* test(unit): fixing client version we get for bootstrap to use the same source of truth

* fix: ci lint

* Update pkg/providers/imagefamily/bootstrap/aksbootstrap.go

Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>

* chore: adding util for checking if we need to use OOT Credential

* chore: moving the OOTCredential CSE var to only be populated for 1.30+

* Update hack/toolchain.sh

Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>

---------

Co-authored-by: Alex Leites <18728999+tallaxes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants