-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky test for GracefulNodeShutdown #120728
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wzshiming The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test ? |
@wzshiming: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
a81d057
to
9ef1a04
Compare
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
9ef1a04
to
0a96609
Compare
0a96609
to
d167315
Compare
4b5e40c
to
568a1a2
Compare
568a1a2
to
d59444e
Compare
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
The PR currently only tried to print more information. /cc @kwilczynski @kannon92 @bobbypage Do you have any idea on this? |
@pacoxu: GitHub didn't allow me to request PR reviews from the following users: kwilczynski. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
d59444e
to
be085a7
Compare
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
|
||
framework.Logf("Running systemd version %d", systemdVersion) | ||
|
||
err = checkInhibit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip when dbus it is not working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is will this end up skipping the tests most of the time?
It would be nice to test this in a eventually loop just so we try to make sure this is ready before we skip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why Dbus is either not running or is somewhat broken?
We recently removed the test restarting Dbus, causing some, if not most, of the systemd services issues.
The systemd-inhibit would only ever fall if it cannot connect to Dbus (including socket activation failure) or the connection timed out.
I wonder what the root cause might be, especially if something we do is causing problems with Dbus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is will this end up skipping the tests most of the time?
It would be nice to test this in a eventually loop just so we try to make sure this is ready before we skip.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why Dbus is either not running or is somewhat broken?
We recently removed the test restarting Dbus, causing some, if not most, of the systemd services issues.
The systemd-inhibit would only ever fall if it cannot connect to Dbus (including socket activation failure) or the connection timed out.
I wonder what the root cause might be, especially if something we do is causing problems with Dbus.
Maybe the environment has changed as a result of the migration to community clusters? I'm not sure, and I'm still diagnosing the cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wzshiming, I see. It might well be.
Could the sizing of the nodes or allotted resources be an issue here? I see a lot of complaints about tests running into CPU limits. That said, I don't know how significant this is here.
The limit updates (two most recent ones):
- Failure cluster: Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval #118491
- e2e tests of "[sig-node] regular resource usage tracking" are failed #67621
Not sure if the above has any significance.
However, I am not sure how and why Dbus would be affected. Does it crash? Is it not started? I am not sure about Google's COS, but at least on Fedora CoreOS and Ubuntu, it's a centrepiece for systemd, so a lot of things would break - the most obvious (aside from other services like the systemd-inhibit) would be SSH access and sudo, which would become terribly slow if not inaccessible.
I looked at the build logs - but there is little there in terms of the OS-level information that is being captured. I am not sure where to find OS-level logs if at all possible, to get these.
Should we bump the systemd log level too? Thoughts?
/test pull-kubernetes-node-kubelet-serial-containerd |
Do we still need this PR, since #121506 merged? |
Yes, that just removes one case, but the other cases are still flaky, this PR finds a way to determine if the dbus is currently working, and skips the test if it's not. |
be085a7
to
abbe4d0
Compare
/test pull-kubernetes-node-kubelet-serial-containerd |
/retest-required |
@wzshiming: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@wzshiming thanks for this fix. I think there was a recent kernel fix that resolved these issues. https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd I think we can close this PR. @kwilczynski and I noticed that this change was really just skipping the tests which isn't what we want. |
/close |
@wzshiming: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind flake
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #120726
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: