🐛 Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d #4746

dlipovetsky · 2024-01-18T22:45:21Z

What type of PR is this?
/kind bug

What this PR does / why we need it:

The boothook fetches sensitive user-data from an AWS service (Secrets Manager, or SSM Parameter Store). This PR changes the mechanism by the way this user-data is passed to cloud-init once it's fetched.

Previously, the boothook wrote the sensitive user-data to /etc/secret-userdata.txt, and cloud-init read it via an #include directive. Now, the boothook writes it to /etc/cloud/cloud.cfg.d/99_kubeadm_bootstrap.cfg. The directory is a well-documented configuration source used by cloud-init, and exists wherever cloud-init is installed. The file is given the prefix 99_ to give it high priority over other configuration in that directory.

Previously, cloud-init read sensitive user-data from /etc/secret-userdata.txt via an #include directive. Now, it reads the sensitive user-data simply because it is located in the /etc/cloud/cloud.cfg.d directory. Therefore, the #include directive is no longer used, and is removed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4745

Special notes for your reviewer:

If we merge this PR, we can revert the workaround introduced in kubernetes-sigs/image-builder#406.

Checklist:

Release note:

Changes the mechanism to pass sensitive user-data to cloud-init, making CAPA compatible with cloud-init v23.3 and newer.

k8s-ci-robot · 2024-01-18T22:45:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from dlipovetsky. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dlipovetsky · 2024-01-18T23:00:12Z

This change must be validated e2e. I've already tested it using my own AWS account, so I'm confident it will pass e2e.

/test pull-cluster-api-provider-aws-e2e

dlipovetsky · 2024-01-19T02:07:34Z

/cc @randomvariable You know this area. Tagging you, in case you have questions/concerns about this change.

dlipovetsky · 2024-01-19T18:52:12Z

I'd like to backport this to supported release branches, too.

richardcase · 2024-01-23T14:46:48Z

/milestone v2.4.0

AndiDog · 2024-01-24T07:07:41Z

/lgtm

faiq · 2024-01-24T14:32:34Z

/retest

faiq · 2024-01-24T14:36:03Z

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

faiq · 2024-01-24T14:36:23Z

/lgtm

dlipovetsky · 2024-01-25T19:42:55Z

/retest

Ankitasw · 2024-01-31T10:35:47Z

/test pull-cluster-api-provider-aws-e2e

Ankitasw · 2024-02-01T06:20:47Z

@dlipovetsky looks like E2E tests needs to be fixed

richardcase · 2024-02-04T12:12:13Z

/test pull-cluster-api-provider-aws-e2e

richardcase · 2024-02-04T12:13:05Z

/test pull-cluster-api-provider-aws-apidiff-main

dlipovetsky · 2024-02-05T20:20:51Z

Last e2e failure was due to reaching EventBridge resource quota. From the manager log:

E0204 12:33:36.396548       1 awscluster_controller.go:309] "non-fatal: failed to set up EventBridge" err="unable to create rule: LimitExceededException: The requested resource exceeds the maximum number allowed." controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8" namespace="functional-test-multi-az-nacdlz" name="functional-test-multi-az-k1r9x8" reconcileID="83e72c26-318f-4b1e-8575-eaddab4426f4" cluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8"

richardcase · 2024-02-06T13:45:10Z

/test pull-cluster-api-provider-aws-e2e

nrb · 2024-02-06T20:34:20Z

/test pull-cluster-api-provider-aws-e2e

richardcase · 2024-02-14T15:14:07Z

/test pull-cluster-api-provider-aws-e2e

dlipovetsky · 2024-02-15T17:33:51Z

I'm going to rebase on main, in case there have been some changes that affect e2e.

k8s-ci-robot · 2024-02-15T17:34:28Z

New changes are detected. LGTM label has been removed.

dlipovetsky · 2024-02-15T17:37:04Z

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

This might be possible if we implement our own Part Handler that calls the Secrets or SSM service.

dlipovetsky · 2024-02-15T17:44:36Z

/test pull-cluster-api-provider-aws-e2e

nrb · 2024-02-20T16:48:13Z

/retest

dlipovetsky · 2024-02-21T00:34:46Z

I still have no idea why the same 7 tests consistently fail.

nrb · 2024-03-04T17:27:05Z

/test pull-cluster-api-provider-aws-build-docker

Ankitasw · 2024-03-07T07:03:24Z

@dlipovetsky maybe you could try rebasing the PR and then run the E2E tests?

This allows cloud-init to read the user-data without using an #include, which always fails when cloud-init first runs.

dlipovetsky · 2024-03-11T15:46:19Z

/retest

dlipovetsky · 2024-03-11T19:27:26Z

/test pull-cluster-api-provider-aws-e2e

k8s-ci-robot · 2024-03-11T21:07:36Z

@dlipovetsky: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-e2e	`b21896e`	link	false	`/test pull-cluster-api-provider-aws-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dlipovetsky · 2024-03-14T23:20:32Z

I found the cause of the failing tests.

These tests use the images created for Kubernetes v1.25.3. They were created in October 2022. They have cloud-init v22.3.4. This version does not support Jinja templating for cloud-config sources in /etc/cloud/. This feature was added in v22.4.

The cloud-config created by CABPK happens to have a single instance of Jinja templating, for the kubeadm configuration:

name: '{{ ds.meta_data.local_hostname }}'

And because Jinja templating isn't supported, the kubeadm configuration is written out with the Jinja string, instead of the templated value, so the configuration is invalid:

# kubeadm init --config /run/kubeadm/kubeadm.yaml
nodeRegistration.name: Invalid value: "{{ ds.meta_data.local_hostname }}": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
To see the stack trace of this error execute with --v=5 or higher

I'm considering what to do.

nrb · 2024-03-15T16:19:46Z

@dlipovetsky Thanks for your persistence!

Could we create a new image with Kube 1.26 and cloud-init v22.4? Or maybe a 1.25 image with v22.4?

dlipovetsky · 2024-03-15T17:30:07Z

@dlipovetsky Thanks for your persistence!

Could we create a new image with Kube 1.26 and cloud-init v22.4? Or maybe a 1.25 image with v22.4?

As soon as we're able to publish images into a CNCF-owned account (depends on kubernetes/k8s.io#6517).

--

I also want to take this opportunity to question why we have Jinja template strings in our cluster templates at all. If I remove that string, the AWS Cloud Provider fails, apparently

I0315 15:38:27.112256       1 node_controller.go:390] Initializing node ip-10-0-231-13 with cloud provider
E0315 15:38:27.204950       1 node_controller.go:212] error syncing 'ip-10-0-231-13': failed to get provider ID for node ip-10-0-231-13 at cloudprovider: fa
iled to get instance ID from cloud provider: instance not found, requeuing

And therefore the node never gets its .spec.ProviderID set, and the "uninitialized" taint is not removed from the node, blocking Pods from being scheduled...

dlipovetsky · 2024-03-22T00:53:53Z

Turns out that defining a part handler will not help here. Cloud-init will fetch any "includes" before running part handlers. If the include is missing, cloud-init will fail.

richardcase · 2024-04-26T16:06:48Z

/milestone v2.6.0

nrb · 2024-04-29T16:49:34Z

We cannot merge this until #4746 (comment) is resolved.

/hold

richardcase · 2024-07-23T06:51:20Z

Based on the comment lets push this to the next milestone

/milestone v2.7.0

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 18, 2024

k8s-ci-robot added the needs-priority label Jan 18, 2024

k8s-ci-robot requested review from cnmcavoy and faiq January 18, 2024 22:45

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 18, 2024

k8s-ci-robot added this to the v2.4.0 milestone Jan 23, 2024

k8s-ci-robot assigned AndiDog Jan 24, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2024

faiq approved these changes Jan 24, 2024

View reviewed changes

k8s-ci-robot assigned faiq Jan 24, 2024

faiq removed their assignment Jan 24, 2024

dlipovetsky force-pushed the boothook-write-to-cloud.cfg.d branch from 11875a8 to f96b742 Compare February 15, 2024 17:34

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 15, 2024

fix: Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d

b21896e

This allows cloud-init to read the user-data without using an #include, which always fails when cloud-init first runs.

dlipovetsky force-pushed the boothook-write-to-cloud.cfg.d branch from f96b742 to b21896e Compare March 8, 2024 18:37

dlipovetsky mentioned this pull request Mar 22, 2024

Machine with cloud-init 23.3.0 or newer fails to join cluster #4745

Open

k8s-ci-robot modified the milestones: v2.4.0, v2.6.0 Apr 26, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2024

dlipovetsky mentioned this pull request Apr 29, 2024

📖 Update userdata privacy doc with details on cloud-init and ignition #4965

Open

5 tasks

k8s-ci-robot modified the milestones: v2.6.0, v2.7.0 Jul 23, 2024

🐛 Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d #4746

Are you sure you want to change the base?

🐛 Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d #4746

Conversation

dlipovetsky commented Jan 18, 2024

k8s-ci-robot commented Jan 18, 2024

dlipovetsky commented Jan 18, 2024

dlipovetsky commented Jan 19, 2024

dlipovetsky commented Jan 19, 2024

richardcase commented Jan 23, 2024

AndiDog commented Jan 24, 2024

faiq commented Jan 24, 2024

faiq commented Jan 24, 2024

faiq commented Jan 24, 2024

dlipovetsky commented Jan 25, 2024

Ankitasw commented Jan 31, 2024

Ankitasw commented Feb 1, 2024

richardcase commented Feb 4, 2024

richardcase commented Feb 4, 2024

dlipovetsky commented Feb 5, 2024

richardcase commented Feb 6, 2024

nrb commented Feb 6, 2024

richardcase commented Feb 14, 2024

dlipovetsky commented Feb 15, 2024

k8s-ci-robot commented Feb 15, 2024

dlipovetsky commented Feb 15, 2024

dlipovetsky commented Feb 15, 2024

nrb commented Feb 20, 2024

dlipovetsky commented Feb 21, 2024

nrb commented Mar 4, 2024

Ankitasw commented Mar 7, 2024

dlipovetsky commented Mar 11, 2024

dlipovetsky commented Mar 11, 2024

k8s-ci-robot commented Mar 11, 2024

dlipovetsky commented Mar 14, 2024

nrb commented Mar 15, 2024

dlipovetsky commented Mar 15, 2024

dlipovetsky commented Mar 22, 2024

richardcase commented Apr 26, 2024

nrb commented Apr 29, 2024

richardcase commented Jul 23, 2024