Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d #4746

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dlipovetsky
Copy link
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:

The boothook fetches sensitive user-data from an AWS service (Secrets Manager, or SSM Parameter Store). This PR changes the mechanism by the way this user-data is passed to cloud-init once it's fetched.

Previously, the boothook wrote the sensitive user-data to /etc/secret-userdata.txt, and cloud-init read it via an #include directive. Now, the boothook writes it to /etc/cloud/cloud.cfg.d/99_kubeadm_bootstrap.cfg. The directory is a well-documented configuration source used by cloud-init, and exists wherever cloud-init is installed. The file is given the prefix 99_ to give it high priority over other configuration in that directory.

Previously, cloud-init read sensitive user-data from /etc/secret-userdata.txt via an #include directive. Now, it reads the sensitive user-data simply because it is located in the /etc/cloud/cloud.cfg.d directory. Therefore, the #include directive is no longer used, and is removed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4745

Special notes for your reviewer:

If we merge this PR, we can revert the workaround introduced in kubernetes-sigs/image-builder#406.

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

Changes the mechanism to pass sensitive user-data to cloud-init, making CAPA compatible with cloud-init v23.3 and newer.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from dlipovetsky. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 18, 2024
@dlipovetsky
Copy link
Contributor Author

This change must be validated e2e. I've already tested it using my own AWS account, so I'm confident it will pass e2e.

/test pull-cluster-api-provider-aws-e2e

@dlipovetsky
Copy link
Contributor Author

/cc @randomvariable You know this area. Tagging you, in case you have questions/concerns about this change.

@dlipovetsky
Copy link
Contributor Author

I'd like to backport this to supported release branches, too.

@richardcase
Copy link
Member

/milestone v2.4.0

@k8s-ci-robot k8s-ci-robot added this to the v2.4.0 milestone Jan 23, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Jan 24, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2024
@faiq
Copy link
Contributor

faiq commented Jan 24, 2024

/retest

@faiq
Copy link
Contributor

faiq commented Jan 24, 2024

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

@faiq
Copy link
Contributor

faiq commented Jan 24, 2024

/lgtm

@faiq faiq removed their assignment Jan 24, 2024
@dlipovetsky
Copy link
Contributor Author

/retest

@Ankitasw
Copy link
Member

/test pull-cluster-api-provider-aws-e2e

@Ankitasw
Copy link
Member

Ankitasw commented Feb 1, 2024

@dlipovetsky looks like E2E tests needs to be fixed

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-apidiff-main

@dlipovetsky
Copy link
Contributor Author

Last e2e failure was due to reaching EventBridge resource quota. From the manager log:

E0204 12:33:36.396548       1 awscluster_controller.go:309] "non-fatal: failed to set up EventBridge" err="unable to create rule: LimitExceededException: The requested resource exceeds the maximum number allowed." controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8" namespace="functional-test-multi-az-nacdlz" name="functional-test-multi-az-k1r9x8" reconcileID="83e72c26-318f-4b1e-8575-eaddab4426f4" cluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8"

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e

1 similar comment
@nrb
Copy link
Contributor

nrb commented Feb 6, 2024

/test pull-cluster-api-provider-aws-e2e

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e

@dlipovetsky
Copy link
Contributor Author

I'm going to rebase on main, in case there have been some changes that affect e2e.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 15, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@dlipovetsky
Copy link
Contributor Author

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

This might be possible if we implement our own Part Handler that calls the Secrets or SSM service.

@dlipovetsky
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

@nrb
Copy link
Contributor

nrb commented Feb 20, 2024

/retest

@dlipovetsky
Copy link
Contributor Author

I still have no idea why the same 7 tests consistently fail.

@nrb
Copy link
Contributor

nrb commented Mar 4, 2024

/test pull-cluster-api-provider-aws-build-docker

@Ankitasw
Copy link
Member

Ankitasw commented Mar 7, 2024

@dlipovetsky maybe you could try rebasing the PR and then run the E2E tests?

This allows cloud-init to read the user-data without using an #include,
which always fails when cloud-init first runs.
@dlipovetsky dlipovetsky force-pushed the boothook-write-to-cloud.cfg.d branch from f96b742 to b21896e Compare March 8, 2024 18:37
@dlipovetsky
Copy link
Contributor Author

/retest

@dlipovetsky
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e

@k8s-ci-robot
Copy link
Contributor

@dlipovetsky: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e b21896e link false /test pull-cluster-api-provider-aws-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dlipovetsky
Copy link
Contributor Author

I found the cause of the failing tests.

These tests use the images created for Kubernetes v1.25.3. They were created in October 2022. They have cloud-init v22.3.4. This version does not support Jinja templating for cloud-config sources in /etc/cloud/. This feature was added in v22.4.

The cloud-config created by CABPK happens to have a single instance of Jinja templating, for the kubeadm configuration:

name: '{{ ds.meta_data.local_hostname }}'

And because Jinja templating isn't supported, the kubeadm configuration is written out with the Jinja string, instead of the templated value, so the configuration is invalid:

# kubeadm init --config /run/kubeadm/kubeadm.yaml
nodeRegistration.name: Invalid value: "{{ ds.meta_data.local_hostname }}": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
To see the stack trace of this error execute with --v=5 or higher

I'm considering what to do.

@nrb
Copy link
Contributor

nrb commented Mar 15, 2024

@dlipovetsky Thanks for your persistence!

Could we create a new image with Kube 1.26 and cloud-init v22.4? Or maybe a 1.25 image with v22.4?

@dlipovetsky
Copy link
Contributor Author

@dlipovetsky Thanks for your persistence!

Could we create a new image with Kube 1.26 and cloud-init v22.4? Or maybe a 1.25 image with v22.4?

As soon as we're able to publish images into a CNCF-owned account (depends on kubernetes/k8s.io#6517).

--

I also want to take this opportunity to question why we have Jinja template strings in our cluster templates at all. If I remove that string, the AWS Cloud Provider fails, apparently

I0315 15:38:27.112256       1 node_controller.go:390] Initializing node ip-10-0-231-13 with cloud provider
E0315 15:38:27.204950       1 node_controller.go:212] error syncing 'ip-10-0-231-13': failed to get provider ID for node ip-10-0-231-13 at cloudprovider: fa
iled to get instance ID from cloud provider: instance not found, requeuing

And therefore the node never gets its .spec.ProviderID set, and the "uninitialized" taint is not removed from the node, blocking Pods from being scheduled...

@dlipovetsky
Copy link
Contributor Author

Turns out that defining a part handler will not help here. Cloud-init will fetch any "includes" before running part handlers. If the include is missing, cloud-init will fail.

@richardcase
Copy link
Member

/milestone v2.6.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v2.4.0, v2.6.0 Apr 26, 2024
@nrb
Copy link
Contributor

nrb commented Apr 29, 2024

We cannot merge this until #4746 (comment) is resolved.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2024
@richardcase
Copy link
Member

Based on the comment lets push this to the next milestone

/milestone v2.7.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v2.6.0, v2.7.0 Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Machine with cloud-init 23.3.0 or newer fails to join cluster
7 participants