Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test for kubelet instance config #3126

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Nov 18, 2024

Add e2e test for kubelet instance config.

relate: kubernetes/test-infra#33807

--- edit neolit123
xref

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kinder Issues to track work in the kinder tool labels Nov 18, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 18, 2024
@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from 0825734 to 2c7be5a Compare November 18, 2024 14:31
@HirazawaUi
Copy link
Contributor Author

/retest

@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch 3 times, most recently from 42dd92f to 7d1b378 Compare November 19, 2024 01:12
@HirazawaUi
Copy link
Contributor Author

/cc @neolit123

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should do the following (when the FG is enabled):

  • kubeadm init with "latest"
  • kubeadm join with "latest"
  • check that that instance config is on the nodes (it should be there)
  • delete the instance config from the nodes
  • kubeadm upgrade "latest" (it would be latest->latest technically)
  • check if the instance config is on the nodes (it should be there)

@HirazawaUi
Copy link
Contributor Author

i think we should do the following (when the FG is enabled):

  • kubeadm init with "latest"
  • kubeadm join with "latest"
  • check that that instance config is on the nodes (it should be there)
  • delete the instance config from the nodes
  • kubeadm upgrade "latest" (it would be latest->latest technically)
  • check if the instance config is on the nodes (it should be there)

Yes, this is more comprehensive. I'll make the revisions according to it later.

@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch 2 times, most recently from b390952 to 8dbb097 Compare November 28, 2024 13:52
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 28, 2024
@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from 8dbb097 to 00aebcc Compare November 28, 2024 13:57
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment. i think the test is fine, otherwise.

@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from 00aebcc to c47bde5 Compare November 29, 2024 15:13
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
thanks

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 29, 2024
@HirazawaUi
Copy link
Contributor Author

/hold

I just realized that we currently have no way to test this PR against the new tests added in kubernetes/test-infra#33807. I am conducting local tests, and once the testing is complete, the PR can be merged.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2024
@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from c47bde5 to 36b599d Compare November 30, 2024 12:25
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2024
@HirazawaUi
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2024
- |
set -x

for node in worker-{1..2}; do
Copy link
Contributor Author

@HirazawaUi HirazawaUi Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to delete the instance kubelet configuration for worker nodes. For control-plane nodes, they must obtain the criSocket from either the annotation or the local instance kubelet configuration.

This behavior is expected: for control-plane nodes, the instance kubelet configuration file is generated only when kubeadm init is executed with the NodeLocalCRISocket feature gate enabled, and the kubeadm.alpha.kubernetes.io/cri-socket annotation will not added to the node object.

In this case, deleting the instance kubelet configuration file will result in an error state.

ref: https://github.com/kubernetes/kubernetes/blob/810e9e212ec5372d16b655f57b9231d8654a2179/cmd/kubeadm/app/cmd/upgrade/node.go#L193-L196

Copy link
Member

@neolit123 neolit123 Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't seem correct. the behavior should be the same for both workers and CP. if the FG is enabled the instance config should be written on all nodes. and in this step we want to remove it so that upgrade regenerates it if the FG is enabled. note workers fetch the ClusterConfiguration FGs from the kubeeadm-config CM.

why is there are difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both control plane nodes and worker nodes will write to the kubelet instance configuration. However, during the execution of kubeadm upgrade node, the control plane nodes will retrieve the CRI information either from the Node object or the kubelet instance configuration.

If the feature gate was enabled when the cluster was initialized using kubeadm init, the Node objects will not have the kubeadm.alpha.kubernetes.io/cri-socket annotation. If the kubelet instance configuration is deleted, an error will occur during kubeadm upgrade because the CRI information cannot be retrieved.

Copy link
Member

@neolit123 neolit123 Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, ok understand the situation.

i think we need two upgrade scenarios (tasks) in the workflow

  1. upgrading from FG enabled -> FG enabled
  • delete the annotation / make sure it's not there
  • instance-config is present
  • call upgrade
  • make sure instance-config is present
  1. upgrading from FG disabled -> FG enabled
  • delete instance config
  • apply the node annotations with kubectl
  • call upgrade
  • make sure instance-config is present

some other notes:

  • we should ideally also test that the kubeadm-flags.env file no longer has the containeruntime flag (that's still missing in k/k)
  • the same story should be the same for workers and control-plane nodes in the e2e
  • the tasks in the workflow should have unique names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. upgrading from FG disabled -> FG enabled

  • delete instance config
  • apply the node annotations with kubectl
  • call upgrade
  • make sure instance-config is present

It seems there is an issue. The instance config did not exist before the upgrade.

The correct process should be:

  • create the cluster with the feature gate disabled.
  • verify that the annotation exists and the instance config does not exist.
  • call the upgrade with the feature gate enabled.
  • verify that the annotation not exists and the instance config is present.

Copy link
Member

@neolit123 neolit123 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to create the cluster with FG enabled to verify instance config will be created.
note both 1 and 2 are steps in the same kinder workflow file.

then my steps in 2 make sure that if the FG is enabled (in clusterconfiguration) but if the instance config is missing it would use the annotation or flags on disk values to migrate the user to start using the FG and instance config. is that not a supported case in the code in k/k?

kubeadm feature gates can also be enabled before upgrade. by the user manually editing the kubeadm-config CM and then "kubeadm upgrade" commands can execute differently based on the FG values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to create the cluster with FG enabled to verify instance config will be created.
note both 1 and 2 are steps in the same kinder workflow file.

Ah, sorry, I initially thought it needed two files, like this:

tasks:
- import: instance-config-enabled-upgrade-to-enabled-tasks.yaml
- import: instance-config-disabled-upgrade-to-enabled-tasks.yaml

I've got what you mean. Thank you for your patient explanation.

@pacoxu
Copy link
Member

pacoxu commented Dec 5, 2024

BTW, @neolit123 is it neccessary for us to add some CIs like https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features to enable all alpha or beta features during the upgrade and init? This is out of the scope of this PR.

- --name={{ .vars.clusterName }}
- --loglevel=debug
- --kubeadm-verbosity={{ .vars.kubeadmVerbosity }}
- --kubeadm-feature-gate="NodeLocalCRISocket=true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only init with the FG.

  • we can enable the FG before join and upgrade.

"kubeadm-config": func(c *status.Cluster, flags *RunOptions) error {
// Nb. this action is invoked automatically at kubeadm init/join time, but it is possible
// to invoke it separately as well
return KubeadmConfig(c, flags.kubeadmConfigVersion, flags.copyCertsMode, flags.discoveryMode, flags.featureGate, flags.encryptionAlgorithm, flags.upgradeVersion, c.K8sNodes().EligibleForActions()...)
},
"kubeadm-init": func(c *status.Cluster, flags *RunOptions) error {
return KubeadmInit(c, flags.usePhases, flags.copyCertsMode, flags.kubeadmConfigVersion, flags.patchesDir, flags.ignorePreflightErrors, flags.featureGate, flags.encryptionAlgorithm, flags.wait, flags.vLevel)
},
"kubeadm-join": func(c *status.Cluster, flags *RunOptions) error {
return KubeadmJoin(c, flags.usePhases, flags.copyCertsMode, flags.discoveryMode, flags.kubeadmConfigVersion, flags.patchesDir, flags.ignorePreflightErrors, flags.wait, flags.vLevel)
},
"kubeadm-upgrade": func(c *status.Cluster, flags *RunOptions) error {
return KubeadmUpgrade(c, flags.upgradeVersion, flags.patchesDir, flags.wait, flags.vLevel)
},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only calling the FG on init is fine, because it will be stored in the CM.
this plan will work #3126 (comment)

@neolit123
Copy link
Member

BTW, @neolit123 is it neccessary for us to add some CIs like https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features to enable all alpha or beta features during the upgrade and init? This is out of the scope of this PR.

i personally think that our current setup is better which is - one dedicated e2e for each FG.
if we have FGs that interact with each other we might want to allow kinder to enable >1 but that's more of a corner case.

@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from 36b599d to a021c22 Compare December 6, 2024 01:17

exit 0
timeout: 5m
- name: disable-features-gate
Copy link
Member

@neolit123 neolit123 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't disable the FG.

if the FG is enabled before upgrade manually by the user it should transform the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't disable the FG.

I think whether the feature gate was enabled before the upgrade is not important to us.

For the new functionality we introduced, we won’t check whether the feature gate was enabled previously; we will only verify its annotation and instance-config.yaml.

Here, we simulate a cluster that has never used the NodeLocalCRISocket feature gate before. It has the kubeadm.alpha.kubernetes.io/cri-socket annotation and no instance-config.yaml. During the upgrade, we enable the feature gate and observe whether the subsequent behavior meets our expectations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I have misunderstood your meaning.

Copy link
Member

@neolit123 neolit123 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the test that we need explained again. some of these might need k/k code changes.

  • keep the FG enabled in the cluster
  • make sure the annotation is present
  • make sure that the instance config is not present
  • run upgrade
  • make sure that instance config was generated
  • make sure that the --container-runtime-endpoint flag is removed from the kubeadm-flags.env (needs k/k change)
  • make sure that the annotation was removed (needs k/k change)

this covers the scenario where users will enable the FG before upgrading and the same cluster did not have the FG enabled.
on upgrade kubeadm does behave differently if the FG is enabled.
https://github.com/kubernetes/kubernetes/pull/128031/files#diff-7bb5f1d4cb628a783d6b247da73a9ef0b93f793b62ed12478eb74d356f8809bfR128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the test that we need explained again. some of these might need k/k code changes.

  • keep the FG enabled in the cluster
  • make sure the annotation is present
  • make sure that the instance config is not present
  • run upgrade
  • make sure that instance config was generated
  • make sure that the --container-runtime-endpoint flag is removed from the kubeadm-flags.env (needs k/k change)
  • make sure that the annotation was removed (needs k/k change)

I now understand the core of our disagreement. I should not disable the NodeLocalCRISocket feature gate in kubeadm-config before the kubeadm upgrade and then attempt to enable it during kubeadm upgrade using the --feature-gates parameter.

This is because kubeadm upgrade does not accept the --feature-gates parameter. For it, the only way to determine whether a feature gate is enabled is through the feature-gates defined in kubeadm-config.

I think the current test tasks already cover the relevant cases:

  • task-add-cri-socket-annotation: Ensure the annotation is present.
  • task-remove-config: Ensure the instance config is not present.
  • task-upgrade-again: Run the upgrade process.
  • task-check-config-after-second-upgrade: Ensure the instance config was generated.

But...., there is another issue to address. If we disable the NodeLocalCRISocket feature gate in kubeadm-config, then the task-check-config-after-second-upgrade should fail. This is because the --kubeadm-feature-gates="NodeLocalCRISocket=true" parameter added in task-upgrade-again is ineffective. I will investigate tomorrow why this is happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But...., there is another issue to address. If we disable the NodeLocalCRISocket feature gate in kubeadm-config, then the task-check-config-after-second-upgrade should fail. This is because the --kubeadm-feature-gates="NodeLocalCRISocket=true" parameter added in task-upgrade-again is ineffective. I will investigate tomorrow why this is happening.

The reason is that the scripts in check-config-after-second-upgrade and check-config always end with exit code 0 because I did not capture the return result of the commands executed inside the container, and the script always ends with exit 0.

I have removed the disable-features-gate task and modified all the scripts to capture the return values from the container. Please review it again to see if it now meets our expectations.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, minus you don't need to pass the FG on anything but init.

also we need to add the two missing things after we update k/k:

  • on upgrade, remove --container-runtime-endpoint from kubeadm-flags.env check in this e2e check that it's not there anymore
  • after upgrade if the FG is enabled make sure that the annotation is also removed from all nodes

@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from a32e57c to 65235e1 Compare December 9, 2024 13:52
@HirazawaUi HirazawaUi force-pushed the add-instance-config-tests branch from 65235e1 to 9c4740c Compare December 9, 2024 13:56
@HirazawaUi
Copy link
Contributor Author

@neolit123 Do you think we can merge it now, or should we first fix the missing things in k/k?

@pacoxu
Copy link
Member

pacoxu commented Dec 16, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 16, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HirazawaUi, neolit123, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f56222e into kubernetes:main Dec 16, 2024
4 checks passed
@neolit123
Copy link
Member

@neolit123 Do you think we can merge it now, or should we first fix the missing things in k/k?

the ci is green. please find time to fix the two remaining issues in k/k. we should try to get them in 1.33 and update the e2e.

@HirazawaUi
Copy link
Contributor Author

@neolit123 Do you think we can merge it now, or should we first fix the missing things in k/k?

the ci is green. please find time to fix the two remaining issues in k/k. we should try to get them in 1.33 and update the e2e.

I checked the test logs, and everything matches expectations.

Many thanks to @neolit123 for the help! I will address the remaining issues right away.

@HirazawaUi HirazawaUi deleted the add-instance-config-tests branch December 29, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kinder Issues to track work in the kinder tool cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants