-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mantle/kola: add detection for a kernel soft lockup #3450
Conversation
Recently we saw a test with many soft lockup messages like: ``` [ 4159.779792] watchdog: BUG: soft lockup - CPU#0 stuck for 883s! [kworker/u2:0:10] [ 4159.780488] Modules linked in: [ 4159.780787] CPU: 0 PID: 10 Comm: kworker/u2:0 Tainted: G L ------- --- 6.4.0-0.rc0.20230502git865fdb08197e.11.fc39.x86_64 #1 [ 4159.780787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 [ 4159.780787] Workqueue: ftrace_check_wq ftrace_check_work_func [ 4159.780787] RIP: 0010:get_symbol_pos+0x5d/0x140 [ 4159.780787] Code: 63 0c 8d 08 d0 6e a0 85 c9 79 0a 48 f7 d1 48 03 0d 50 c2 5a 01 48 39 cf 48 0f 42 f0 48 0f 43 d0 48 89 f0 48 29 d0 48 83 f8 01 <77> ca 48 85 d2 75 08 eb 4b 48 83 ea 01 74 45 8d 42 ff 48 98 48 63 [ 4159.780787] RSP: 0018:ffffb1970005bb80 EFLAGS: 00000202 [ 4159.780787] RAX: 0000000000000032 RBX: ffffffff9f22edb4 RCX: ffffffff9f22ed00 [ 4159.780787] RDX: 0000000000004436 RSI: 0000000000004468 RDI: ffffffff9f22edb4 [ 4159.780787] RBP: ffffb1970005bbce R08: 0000000000000000 R09: ffffb1970005bbc0 [ 4159.780787] R10: 0000000000000000 R11: 00000000000320a7 R12: 0000000000000000 [ 4159.780787] R13: 0000000000000000 R14: ffffb1970005bbc0 R15: 0000000000000000 [ 4159.780787] FS: 0000000000000000(0000) GS:ffff9fa07ec00000(0000) knlGS:0000000000000000 [ 4159.780787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4159.780787] CR2: ffff9fa04fe01000 CR3: 000000000e022000 CR4: 0000000000350ef0 [ 4159.780787] Call Trace: [ 4159.780787] <TASK> [ 4159.780787] kallsyms_lookup_buildid+0x4d/0x130 [ 4159.780787] test_for_valid_rec+0x64/0xb0 [ 4159.780787] ftrace_check_work_func+0x3b/0x60 [ 4159.780787] process_one_work+0x1c7/0x3d0 [ 4159.780787] worker_thread+0x51/0x390 [ 4159.780787] ? __pfx_worker_thread+0x10/0x10 [ 4159.780787] kthread+0xf7/0x130 [ 4159.780787] ? __pfx_kthread+0x10/0x10 [ 4159.780787] ret_from_fork+0x2c/0x50 [ 4159.780787] </TASK> ``` Let's try to detect and report this.
This seems to have notably increased the flake rate in at least rpm-ostree's CI. https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/rpm-ostree/detail/PR-4413/1/pipeline is one example. I think this can happen in general when a CI job is running on a heavily loaded system and qemu itself doesn't get scheduled quickly enough. It looks like kola runs default to 1 vCPU, so it'd be a matter of ensuring reserved capacity for the job. |
After looking into the underlying issues for many flakes in the past few months the kernel soft lockup was one that was a common theme. You're right that a loaded system is likely the largest contributing factor to this showing up. I think knowing will help us realize that the system is loaded and perhaps try to make changes to either guarantee resources to the tests or try to run less tests in parallel. For
This would allow you to look at less flakes, but has the drawback that an intermittent failure could go undetected. |
Saw this flake in https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/fedora-coreos-config/detail/PR-2422/2/pipeline too. Would it make sense to demote this to a warning for now instead of failing the test outright? If the test times out but we see the warning, it's still useful information to help diagnose the flakiness. Re. resource allocation, note we're already declaring the number of vCPUs we need and basing the kola parallelism on that. I guess we could try something like requesting 1.5 vCPUs for every 1 vCPU we intend to use? I bet the larger issue though is I/O, for which there isn't any good k8s support for it yet (kubernetes/kubernetes#54923). We do all of the heavy I/O inside VMs (when building and testing), so we could use qemu's support for it (e.g. |
possibly, but I don't know if we have a mechanism for that. The only drawback I can think of is that waiting for the timeout means wasted time.
exactly. In my mind we'd just need to somehow drop our parallelism if we were seeing this very often. |
Another thing we could do is try to classify failures at runtime and then do something like |
To hopefully get fewer issues with soft lockups cc coreos/coreos-assembler#3450 (comment)
To hopefully get fewer issues with soft lockups cc coreos/coreos-assembler#3450 (comment)
@cgwalters I've opened a revert for this in #3481 |
This patch enables us to have console checks that are non-fatal and will print a log message to the screen and nothing more, which was stated as desirable in [1]. This commit also re-works the implementation of the console/journal checks in runTest() to deduplicate code. It has the side-effect of making SkipConsoleWarnings apply to journal checks too, but I think that's actually a benefit and not a negative. [1] coreos#3450 (comment)
#3494 has a commit to re-enable this check, but smarter this time. |
This patch enables us to have console checks that are non-fatal and will print a log message to the screen and nothing more, which was stated as desirable in [1]. This commit also re-works the implementation of the console/journal checks in runTest() to deduplicate code. It has the side-effect of making SkipConsoleWarnings apply to journal checks too, but I think that's actually a benefit and not a negative. [1] #3450 (comment)
Recently we saw a test with many soft lockup messages like:
Let's try to detect and report this.