-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for kubelet instance config #3126
Add e2e test for kubelet instance config #3126
Conversation
0825734
to
2c7be5a
Compare
/retest |
42dd92f
to
7d1b378
Compare
/cc @neolit123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should do the following (when the FG is enabled):
- kubeadm init with "latest"
- kubeadm join with "latest"
- check that that instance config is on the nodes (it should be there)
- delete the instance config from the nodes
- kubeadm upgrade "latest" (it would be latest->latest technically)
- check if the instance config is on the nodes (it should be there)
kinder/ci/tools/update-workflows/templates/testinfra/kubeadm-kinder-instance-config.yaml
Outdated
Show resolved
Hide resolved
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
Yes, this is more comprehensive. I'll make the revisions according to it later. |
b390952
to
8dbb097
Compare
8dbb097
to
00aebcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one comment. i think the test is fine, otherwise.
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
00aebcc
to
c47bde5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
thanks
/hold I just realized that we currently have no way to test this PR against the new tests added in kubernetes/test-infra#33807. I am conducting local tests, and once the testing is complete, the PR can be merged. |
c47bde5
to
36b599d
Compare
/hold cancel |
- | | ||
set -x | ||
|
||
for node in worker-{1..2}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to delete the instance kubelet configuration for worker nodes. For control-plane nodes, they must obtain the criSocket from either the annotation or the local instance kubelet configuration.
This behavior is expected: for control-plane nodes, the instance kubelet configuration file is generated only when kubeadm init is executed with the NodeLocalCRISocket feature gate enabled, and the kubeadm.alpha.kubernetes.io/cri-socket annotation will not added to the node object.
In this case, deleting the instance kubelet configuration file will result in an error state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't seem correct. the behavior should be the same for both workers and CP. if the FG is enabled the instance config should be written on all nodes. and in this step we want to remove it so that upgrade regenerates it if the FG is enabled. note workers fetch the ClusterConfiguration FGs from the kubeeadm-config CM.
why is there are difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both control plane nodes and worker nodes will write to the kubelet instance configuration. However, during the execution of kubeadm upgrade node
, the control plane nodes will retrieve the CRI information either from the Node object or the kubelet instance configuration.
If the feature gate was enabled when the cluster was initialized using kubeadm init, the Node objects will not have the kubeadm.alpha.kubernetes.io/cri-socket
annotation. If the kubelet instance configuration is deleted, an error will occur during kubeadm upgrade
because the CRI information cannot be retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, ok understand the situation.
i think we need two upgrade scenarios (tasks) in the workflow
- upgrading from FG enabled -> FG enabled
- delete the annotation / make sure it's not there
- instance-config is present
- call upgrade
- make sure instance-config is present
- upgrading from FG disabled -> FG enabled
- delete instance config
- apply the node annotations with kubectl
- call upgrade
- make sure instance-config is present
some other notes:
- we should ideally also test that the kubeadm-flags.env file no longer has the containeruntime flag (that's still missing in k/k)
- the same story should be the same for workers and control-plane nodes in the e2e
- the tasks in the workflow should have unique names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. upgrading from FG disabled -> FG enabled
- delete instance config
- apply the node annotations with kubectl
- call upgrade
- make sure instance-config is present
It seems there is an issue. The instance config did not exist before the upgrade.
The correct process should be:
- create the cluster with the feature gate disabled.
- verify that the annotation exists and the instance config does not exist.
- call the upgrade with the feature gate enabled.
- verify that the annotation not exists and the instance config is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to create the cluster with FG enabled to verify instance config will be created.
note both 1 and 2 are steps in the same kinder workflow file.
then my steps in 2 make sure that if the FG is enabled (in clusterconfiguration) but if the instance config is missing it would use the annotation or flags on disk values to migrate the user to start using the FG and instance config. is that not a supported case in the code in k/k?
kubeadm feature gates can also be enabled before upgrade. by the user manually editing the kubeadm-config CM and then "kubeadm upgrade" commands can execute differently based on the FG values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to create the cluster with FG enabled to verify instance config will be created.
note both 1 and 2 are steps in the same kinder workflow file.
Ah, sorry, I initially thought it needed two files, like this:
tasks:
- import: instance-config-enabled-upgrade-to-enabled-tasks.yaml
- import: instance-config-disabled-upgrade-to-enabled-tasks.yaml
I've got what you mean. Thank you for your patient explanation.
BTW, @neolit123 is it neccessary for us to add some CIs like https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features to enable all alpha or beta features during the upgrade and init? This is out of the scope of this PR. |
- --name={{ .vars.clusterName }} | ||
- --loglevel=debug | ||
- --kubeadm-verbosity={{ .vars.kubeadmVerbosity }} | ||
- --kubeadm-feature-gate="NodeLocalCRISocket=true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only init with the FG.
- we can enable the FG before join and upgrade.
kubeadm/kinder/pkg/cluster/manager/actions/actions.go
Lines 40 to 53 in 33b7812
"kubeadm-config": func(c *status.Cluster, flags *RunOptions) error { | |
// Nb. this action is invoked automatically at kubeadm init/join time, but it is possible | |
// to invoke it separately as well | |
return KubeadmConfig(c, flags.kubeadmConfigVersion, flags.copyCertsMode, flags.discoveryMode, flags.featureGate, flags.encryptionAlgorithm, flags.upgradeVersion, c.K8sNodes().EligibleForActions()...) | |
}, | |
"kubeadm-init": func(c *status.Cluster, flags *RunOptions) error { | |
return KubeadmInit(c, flags.usePhases, flags.copyCertsMode, flags.kubeadmConfigVersion, flags.patchesDir, flags.ignorePreflightErrors, flags.featureGate, flags.encryptionAlgorithm, flags.wait, flags.vLevel) | |
}, | |
"kubeadm-join": func(c *status.Cluster, flags *RunOptions) error { | |
return KubeadmJoin(c, flags.usePhases, flags.copyCertsMode, flags.discoveryMode, flags.kubeadmConfigVersion, flags.patchesDir, flags.ignorePreflightErrors, flags.wait, flags.vLevel) | |
}, | |
"kubeadm-upgrade": func(c *status.Cluster, flags *RunOptions) error { | |
return KubeadmUpgrade(c, flags.upgradeVersion, flags.patchesDir, flags.wait, flags.vLevel) | |
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only calling the FG on init is fine, because it will be stored in the CM.
this plan will work #3126 (comment)
i personally think that our current setup is better which is - one dedicated e2e for each FG. |
36b599d
to
a021c22
Compare
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
|
||
exit 0 | ||
timeout: 5m | ||
- name: disable-features-gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't disable the FG.
if the FG is enabled before upgrade manually by the user it should transform the cluster.
- removing the annotation from each node object (maybe we can do this for 1.33)
- writing the instance-config.yaml on each node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't disable the FG.
I think whether the feature gate was enabled before the upgrade is not important to us.
For the new functionality we introduced, we won’t check whether the feature gate was enabled previously; we will only verify its annotation and instance-config.yaml.
Here, we simulate a cluster that has never used the NodeLocalCRISocket
feature gate before. It has the kubeadm.alpha.kubernetes.io/cri-socket
annotation and no instance-config.yaml
. During the upgrade, we enable the feature gate and observe whether the subsequent behavior meets our expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I have misunderstood your meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the test that we need explained again. some of these might need k/k code changes.
- keep the FG enabled in the cluster
- make sure the annotation is present
- make sure that the instance config is not present
- run upgrade
- make sure that instance config was generated
- make sure that the --container-runtime-endpoint flag is removed from the kubeadm-flags.env (needs k/k change)
- make sure that the annotation was removed (needs k/k change)
this covers the scenario where users will enable the FG before upgrading and the same cluster did not have the FG enabled.
on upgrade kubeadm does behave differently if the FG is enabled.
https://github.com/kubernetes/kubernetes/pull/128031/files#diff-7bb5f1d4cb628a783d6b247da73a9ef0b93f793b62ed12478eb74d356f8809bfR128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the test that we need explained again. some of these might need k/k code changes.
- keep the FG enabled in the cluster
- make sure the annotation is present
- make sure that the instance config is not present
- run upgrade
- make sure that instance config was generated
- make sure that the --container-runtime-endpoint flag is removed from the kubeadm-flags.env (needs k/k change)
- make sure that the annotation was removed (needs k/k change)
I now understand the core of our disagreement. I should not disable the NodeLocalCRISocket
feature gate in kubeadm-config
before the kubeadm upgrade and then attempt to enable it during kubeadm upgrade
using the --feature-gates
parameter.
This is because kubeadm upgrade
does not accept the --feature-gates
parameter. For it, the only way to determine whether a feature gate is enabled is through the feature-gates
defined in kubeadm-config
.
I think the current test tasks already cover the relevant cases:
- task-add-cri-socket-annotation: Ensure the annotation is present.
- task-remove-config: Ensure the instance config is not present.
- task-upgrade-again: Run the upgrade process.
- task-check-config-after-second-upgrade: Ensure the instance config was generated.
But...., there is another issue to address. If we disable the NodeLocalCRISocket
feature gate in kubeadm-config
, then the task-check-config-after-second-upgrade
should fail. This is because the --kubeadm-feature-gates="NodeLocalCRISocket=true"
parameter added in task-upgrade-again
is ineffective. I will investigate tomorrow why this is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But...., there is another issue to address. If we disable the
NodeLocalCRISocket
feature gate inkubeadm-config
, then thetask-check-config-after-second-upgrade
should fail. This is because the--kubeadm-feature-gates="NodeLocalCRISocket=true"
parameter added intask-upgrade-again
is ineffective. I will investigate tomorrow why this is happening.
The reason is that the scripts in check-config-after-second-upgrade
and check-config
always end with exit code 0 because I did not capture the return result of the commands executed inside the container, and the script always ends with exit 0.
I have removed the disable-features-gate task and modified all the scripts to capture the return values from the container. Please review it again to see if it now meets our expectations.
a50135b
to
a32e57c
Compare
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
kinder/ci/tools/update-workflows/templates/workflows/instance-config-tasks.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, minus you don't need to pass the FG on anything but init.
also we need to add the two missing things after we update k/k:
- on upgrade, remove --container-runtime-endpoint from kubeadm-flags.env check in this e2e check that it's not there anymore
- after upgrade if the FG is enabled make sure that the annotation is also removed from all nodes
a32e57c
to
65235e1
Compare
65235e1
to
9c4740c
Compare
@neolit123 Do you think we can merge it now, or should we first fix the missing things in k/k? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HirazawaUi, neolit123, pacoxu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
the ci is green. please find time to fix the two remaining issues in k/k. we should try to get them in 1.33 and update the e2e. |
I checked the test logs, and everything matches expectations. Many thanks to @neolit123 for the help! I will address the remaining issues right away. |
Add e2e test for kubelet instance config.
relate: kubernetes/test-infra#33807
--- edit neolit123
xref