mantle/kola: add detection for a kernel soft lockup #3450

dustymabe · 2023-05-03T15:56:53Z

Recently we saw a test with many soft lockup messages like:

[ 4159.779792] watchdog: BUG: soft lockup - CPU#0 stuck for 883s! [kworker/u2:0:10]
[ 4159.780488] Modules linked in:
[ 4159.780787] CPU: 0 PID: 10 Comm: kworker/u2:0 Tainted: G             L    -------  ---  6.4.0-0.rc0.20230502git865fdb08197e.11.fc39.x86_64 #1
[ 4159.780787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
[ 4159.780787] Workqueue: ftrace_check_wq ftrace_check_work_func
[ 4159.780787] RIP: 0010:get_symbol_pos+0x5d/0x140
[ 4159.780787] Code: 63 0c 8d 08 d0 6e a0 85 c9 79 0a 48 f7 d1 48 03 0d 50 c2 5a 01 48 39 cf 48 0f 42 f0 48 0f 43 d0 48 89 f0 48 29 d0 48 83 f8 01 <77> ca 48 85 d2 75 08 eb 4b 48 83 ea 01 74 45 8d 42 ff 48 98 48 63
[ 4159.780787] RSP: 0018:ffffb1970005bb80 EFLAGS: 00000202
[ 4159.780787] RAX: 0000000000000032 RBX: ffffffff9f22edb4 RCX: ffffffff9f22ed00
[ 4159.780787] RDX: 0000000000004436 RSI: 0000000000004468 RDI: ffffffff9f22edb4
[ 4159.780787] RBP: ffffb1970005bbce R08: 0000000000000000 R09: ffffb1970005bbc0
[ 4159.780787] R10: 0000000000000000 R11: 00000000000320a7 R12: 0000000000000000
[ 4159.780787] R13: 0000000000000000 R14: ffffb1970005bbc0 R15: 0000000000000000
[ 4159.780787] FS:  0000000000000000(0000) GS:ffff9fa07ec00000(0000) knlGS:0000000000000000
[ 4159.780787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4159.780787] CR2: ffff9fa04fe01000 CR3: 000000000e022000 CR4: 0000000000350ef0
[ 4159.780787] Call Trace:
[ 4159.780787]  <TASK>
[ 4159.780787]  kallsyms_lookup_buildid+0x4d/0x130
[ 4159.780787]  test_for_valid_rec+0x64/0xb0
[ 4159.780787]  ftrace_check_work_func+0x3b/0x60
[ 4159.780787]  process_one_work+0x1c7/0x3d0
[ 4159.780787]  worker_thread+0x51/0x390
[ 4159.780787]  ? __pfx_worker_thread+0x10/0x10
[ 4159.780787]  kthread+0xf7/0x130
[ 4159.780787]  ? __pfx_kthread+0x10/0x10
[ 4159.780787]  ret_from_fork+0x2c/0x50
[ 4159.780787]  </TASK>

Let's try to detect and report this.

Recently we saw a test with many soft lockup messages like: ``` [ 4159.779792] watchdog: BUG: soft lockup - CPU#0 stuck for 883s! [kworker/u2:0:10] [ 4159.780488] Modules linked in: [ 4159.780787] CPU: 0 PID: 10 Comm: kworker/u2:0 Tainted: G L ------- --- 6.4.0-0.rc0.20230502git865fdb08197e.11.fc39.x86_64 #1 [ 4159.780787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 [ 4159.780787] Workqueue: ftrace_check_wq ftrace_check_work_func [ 4159.780787] RIP: 0010:get_symbol_pos+0x5d/0x140 [ 4159.780787] Code: 63 0c 8d 08 d0 6e a0 85 c9 79 0a 48 f7 d1 48 03 0d 50 c2 5a 01 48 39 cf 48 0f 42 f0 48 0f 43 d0 48 89 f0 48 29 d0 48 83 f8 01 <77> ca 48 85 d2 75 08 eb 4b 48 83 ea 01 74 45 8d 42 ff 48 98 48 63 [ 4159.780787] RSP: 0018:ffffb1970005bb80 EFLAGS: 00000202 [ 4159.780787] RAX: 0000000000000032 RBX: ffffffff9f22edb4 RCX: ffffffff9f22ed00 [ 4159.780787] RDX: 0000000000004436 RSI: 0000000000004468 RDI: ffffffff9f22edb4 [ 4159.780787] RBP: ffffb1970005bbce R08: 0000000000000000 R09: ffffb1970005bbc0 [ 4159.780787] R10: 0000000000000000 R11: 00000000000320a7 R12: 0000000000000000 [ 4159.780787] R13: 0000000000000000 R14: ffffb1970005bbc0 R15: 0000000000000000 [ 4159.780787] FS: 0000000000000000(0000) GS:ffff9fa07ec00000(0000) knlGS:0000000000000000 [ 4159.780787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4159.780787] CR2: ffff9fa04fe01000 CR3: 000000000e022000 CR4: 0000000000350ef0 [ 4159.780787] Call Trace: [ 4159.780787] <TASK> [ 4159.780787] kallsyms_lookup_buildid+0x4d/0x130 [ 4159.780787] test_for_valid_rec+0x64/0xb0 [ 4159.780787] ftrace_check_work_func+0x3b/0x60 [ 4159.780787] process_one_work+0x1c7/0x3d0 [ 4159.780787] worker_thread+0x51/0x390 [ 4159.780787] ? __pfx_worker_thread+0x10/0x10 [ 4159.780787] kthread+0xf7/0x130 [ 4159.780787] ? __pfx_kthread+0x10/0x10 [ 4159.780787] ret_from_fork+0x2c/0x50 [ 4159.780787] </TASK> ``` Let's try to detect and report this.

cgwalters · 2023-05-15T20:03:02Z

This seems to have notably increased the flake rate in at least rpm-ostree's CI. https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/rpm-ostree/detail/PR-4413/1/pipeline is one example.

I think this can happen in general when a CI job is running on a heavily loaded system and qemu itself doesn't get scheduled quickly enough. It looks like kola runs default to 1 vCPU, so it'd be a matter of ensuring reserved capacity for the job.

dustymabe · 2023-05-16T02:01:26Z

I think this can happen in general when a CI job is running on a heavily loaded system and qemu itself doesn't get scheduled quickly enough. It looks like kola runs default to 1 vCPU, so it'd be a matter of ensuring reserved capacity for the job.

After looking into the underlying issues for many flakes in the past few months the kernel soft lockup was one that was a common theme. You're right that a loaded system is likely the largest contributing factor to this showing up. I think knowing will help us realize that the system is loaded and perhaps try to make changes to either guarantee resources to the tests or try to run less tests in parallel.

For rpm-ostree you could just set the CI to allow all successful re-runs to let the test succeed by updating this line to be:

kola(cosaDir: "${env.WORKSPACE}", extraArgs: "ext.rpm-ostree.*", parallel: nhosts, rerunSuccessArgs: "tags=all")

This would allow you to look at less flakes, but has the drawback that an intermittent failure could go undetected.

jlebon · 2023-05-16T16:16:39Z

Saw this flake in https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/fedora-coreos-config/detail/PR-2422/2/pipeline too. Would it make sense to demote this to a warning for now instead of failing the test outright? If the test times out but we see the warning, it's still useful information to help diagnose the flakiness.

Re. resource allocation, note we're already declaring the number of vCPUs we need and basing the kola parallelism on that. I guess we could try something like requesting 1.5 vCPUs for every 1 vCPU we intend to use?

I bet the larger issue though is I/O, for which there isn't any good k8s support for it yet (kubernetes/kubernetes#54923). We do all of the heavy I/O inside VMs (when building and testing), so we could use qemu's support for it (e.g. --qemu-mbps) ourselves. But without proper k8s support, it would make it hard to share across namespaces.

dustymabe · 2023-05-16T17:31:39Z

Would it make sense to demote this to a warning for now instead of failing the test outright? If the test times out but we see the warning, it's still useful information to help diagnose the flakiness.

possibly, but I don't know if we have a mechanism for that. The only drawback I can think of is that waiting for the timeout means wasted time.

I bet the larger issue though is I/O,

exactly. In my mind we'd just need to somehow drop our parallelism if we were seeing this very often.

dustymabe · 2023-05-17T02:47:31Z

Another thing we could do is try to classify failures at runtime and then do something like --allow-rerun-success failureClass=kernelsoftlockup.

To hopefully get fewer issues with soft lockups cc coreos/coreos-assembler#3450 (comment)

dustymabe · 2023-05-18T21:33:37Z

@cgwalters I've opened a revert for this in #3481

This patch enables us to have console checks that are non-fatal and will print a log message to the screen and nothing more, which was stated as desirable in [1]. This commit also re-works the implementation of the console/journal checks in runTest() to deduplicate code. It has the side-effect of making SkipConsoleWarnings apply to journal checks too, but I think that's actually a benefit and not a negative. [1] coreos#3450 (comment)

dustymabe · 2023-05-31T15:49:10Z

#3494 has a commit to re-enable this check, but smarter this time.

This patch enables us to have console checks that are non-fatal and will print a log message to the screen and nothing more, which was stated as desirable in [1]. This commit also re-works the implementation of the console/journal checks in runTest() to deduplicate code. It has the side-effect of making SkipConsoleWarnings apply to journal checks too, but I think that's actually a benefit and not a negative. [1] #3450 (comment)

gursewak1997 approved these changes May 3, 2023

View reviewed changes

dustymabe enabled auto-merge (rebase) May 3, 2023 16:31

bgilbert approved these changes May 3, 2023

View reviewed changes

dustymabe merged commit a487f53 into coreos:main May 3, 2023
2 checks passed

dustymabe deleted the dusty-check-soft-lockup branch May 3, 2023 18:41

cgwalters mentioned this pull request May 15, 2023

man/rpm-ostree: Document status switches coreos/rpm-ostree#4413

Merged

cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request May 18, 2023

cci: Lower parallelism

d98f845

To hopefully get fewer issues with soft lockups cc coreos/coreos-assembler#3450 (comment)

cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request May 18, 2023

cci: Lower parallelism

653fa7c

To hopefully get fewer issues with soft lockups cc coreos/coreos-assembler#3450 (comment)

cgwalters mentioned this pull request May 18, 2023

cci: Lower parallelism coreos/rpm-ostree#4419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mantle/kola: add detection for a kernel soft lockup #3450

mantle/kola: add detection for a kernel soft lockup #3450

dustymabe commented May 3, 2023

cgwalters commented May 15, 2023

dustymabe commented May 16, 2023

jlebon commented May 16, 2023

dustymabe commented May 16, 2023

dustymabe commented May 17, 2023

dustymabe commented May 18, 2023

dustymabe commented May 31, 2023

mantle/kola: add detection for a kernel soft lockup #3450

mantle/kola: add detection for a kernel soft lockup #3450

Conversation

dustymabe commented May 3, 2023

cgwalters commented May 15, 2023

dustymabe commented May 16, 2023

jlebon commented May 16, 2023

dustymabe commented May 16, 2023

dustymabe commented May 17, 2023

dustymabe commented May 18, 2023

dustymabe commented May 31, 2023