Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mantle/kola: add detection for a kernel soft lockup #3450

Merged
merged 1 commit into from
May 3, 2023

Conversation

dustymabe
Copy link
Member

Recently we saw a test with many soft lockup messages like:

[ 4159.779792] watchdog: BUG: soft lockup - CPU#0 stuck for 883s! [kworker/u2:0:10]
[ 4159.780488] Modules linked in:
[ 4159.780787] CPU: 0 PID: 10 Comm: kworker/u2:0 Tainted: G             L    -------  ---  6.4.0-0.rc0.20230502git865fdb08197e.11.fc39.x86_64 #1
[ 4159.780787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
[ 4159.780787] Workqueue: ftrace_check_wq ftrace_check_work_func
[ 4159.780787] RIP: 0010:get_symbol_pos+0x5d/0x140
[ 4159.780787] Code: 63 0c 8d 08 d0 6e a0 85 c9 79 0a 48 f7 d1 48 03 0d 50 c2 5a 01 48 39 cf 48 0f 42 f0 48 0f 43 d0 48 89 f0 48 29 d0 48 83 f8 01 <77> ca 48 85 d2 75 08 eb 4b 48 83 ea 01 74 45 8d 42 ff 48 98 48 63
[ 4159.780787] RSP: 0018:ffffb1970005bb80 EFLAGS: 00000202
[ 4159.780787] RAX: 0000000000000032 RBX: ffffffff9f22edb4 RCX: ffffffff9f22ed00
[ 4159.780787] RDX: 0000000000004436 RSI: 0000000000004468 RDI: ffffffff9f22edb4
[ 4159.780787] RBP: ffffb1970005bbce R08: 0000000000000000 R09: ffffb1970005bbc0
[ 4159.780787] R10: 0000000000000000 R11: 00000000000320a7 R12: 0000000000000000
[ 4159.780787] R13: 0000000000000000 R14: ffffb1970005bbc0 R15: 0000000000000000
[ 4159.780787] FS:  0000000000000000(0000) GS:ffff9fa07ec00000(0000) knlGS:0000000000000000
[ 4159.780787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4159.780787] CR2: ffff9fa04fe01000 CR3: 000000000e022000 CR4: 0000000000350ef0
[ 4159.780787] Call Trace:
[ 4159.780787]  <TASK>
[ 4159.780787]  kallsyms_lookup_buildid+0x4d/0x130
[ 4159.780787]  test_for_valid_rec+0x64/0xb0
[ 4159.780787]  ftrace_check_work_func+0x3b/0x60
[ 4159.780787]  process_one_work+0x1c7/0x3d0
[ 4159.780787]  worker_thread+0x51/0x390
[ 4159.780787]  ? __pfx_worker_thread+0x10/0x10
[ 4159.780787]  kthread+0xf7/0x130
[ 4159.780787]  ? __pfx_kthread+0x10/0x10
[ 4159.780787]  ret_from_fork+0x2c/0x50
[ 4159.780787]  </TASK>

Let's try to detect and report this.

Recently we saw a test with many soft lockup messages like:

```
[ 4159.779792] watchdog: BUG: soft lockup - CPU#0 stuck for 883s! [kworker/u2:0:10]
[ 4159.780488] Modules linked in:
[ 4159.780787] CPU: 0 PID: 10 Comm: kworker/u2:0 Tainted: G             L    -------  ---  6.4.0-0.rc0.20230502git865fdb08197e.11.fc39.x86_64 #1
[ 4159.780787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
[ 4159.780787] Workqueue: ftrace_check_wq ftrace_check_work_func
[ 4159.780787] RIP: 0010:get_symbol_pos+0x5d/0x140
[ 4159.780787] Code: 63 0c 8d 08 d0 6e a0 85 c9 79 0a 48 f7 d1 48 03 0d 50 c2 5a 01 48 39 cf 48 0f 42 f0 48 0f 43 d0 48 89 f0 48 29 d0 48 83 f8 01 <77> ca 48 85 d2 75 08 eb 4b 48 83 ea 01 74 45 8d 42 ff 48 98 48 63
[ 4159.780787] RSP: 0018:ffffb1970005bb80 EFLAGS: 00000202
[ 4159.780787] RAX: 0000000000000032 RBX: ffffffff9f22edb4 RCX: ffffffff9f22ed00
[ 4159.780787] RDX: 0000000000004436 RSI: 0000000000004468 RDI: ffffffff9f22edb4
[ 4159.780787] RBP: ffffb1970005bbce R08: 0000000000000000 R09: ffffb1970005bbc0
[ 4159.780787] R10: 0000000000000000 R11: 00000000000320a7 R12: 0000000000000000
[ 4159.780787] R13: 0000000000000000 R14: ffffb1970005bbc0 R15: 0000000000000000
[ 4159.780787] FS:  0000000000000000(0000) GS:ffff9fa07ec00000(0000) knlGS:0000000000000000
[ 4159.780787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4159.780787] CR2: ffff9fa04fe01000 CR3: 000000000e022000 CR4: 0000000000350ef0
[ 4159.780787] Call Trace:
[ 4159.780787]  <TASK>
[ 4159.780787]  kallsyms_lookup_buildid+0x4d/0x130
[ 4159.780787]  test_for_valid_rec+0x64/0xb0
[ 4159.780787]  ftrace_check_work_func+0x3b/0x60
[ 4159.780787]  process_one_work+0x1c7/0x3d0
[ 4159.780787]  worker_thread+0x51/0x390
[ 4159.780787]  ? __pfx_worker_thread+0x10/0x10
[ 4159.780787]  kthread+0xf7/0x130
[ 4159.780787]  ? __pfx_kthread+0x10/0x10
[ 4159.780787]  ret_from_fork+0x2c/0x50
[ 4159.780787]  </TASK>
```

Let's try to detect and report this.
@dustymabe dustymabe enabled auto-merge (rebase) May 3, 2023 16:31
@dustymabe dustymabe merged commit a487f53 into coreos:main May 3, 2023
2 checks passed
@dustymabe dustymabe deleted the dusty-check-soft-lockup branch May 3, 2023 18:41
@cgwalters
Copy link
Member

This seems to have notably increased the flake rate in at least rpm-ostree's CI. https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/rpm-ostree/detail/PR-4413/1/pipeline is one example.

I think this can happen in general when a CI job is running on a heavily loaded system and qemu itself doesn't get scheduled quickly enough. It looks like kola runs default to 1 vCPU, so it'd be a matter of ensuring reserved capacity for the job.

@dustymabe
Copy link
Member Author

I think this can happen in general when a CI job is running on a heavily loaded system and qemu itself doesn't get scheduled quickly enough. It looks like kola runs default to 1 vCPU, so it'd be a matter of ensuring reserved capacity for the job.

After looking into the underlying issues for many flakes in the past few months the kernel soft lockup was one that was a common theme. You're right that a loaded system is likely the largest contributing factor to this showing up. I think knowing will help us realize that the system is loaded and perhaps try to make changes to either guarantee resources to the tests or try to run less tests in parallel.

For rpm-ostree you could just set the CI to allow all successful re-runs to let the test succeed by updating this line to be:

kola(cosaDir: "${env.WORKSPACE}", extraArgs: "ext.rpm-ostree.*", parallel: nhosts, rerunSuccessArgs: "tags=all")

This would allow you to look at less flakes, but has the drawback that an intermittent failure could go undetected.

@jlebon
Copy link
Member

jlebon commented May 16, 2023

Saw this flake in https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/fedora-coreos-config/detail/PR-2422/2/pipeline too. Would it make sense to demote this to a warning for now instead of failing the test outright? If the test times out but we see the warning, it's still useful information to help diagnose the flakiness.

Re. resource allocation, note we're already declaring the number of vCPUs we need and basing the kola parallelism on that. I guess we could try something like requesting 1.5 vCPUs for every 1 vCPU we intend to use?

I bet the larger issue though is I/O, for which there isn't any good k8s support for it yet (kubernetes/kubernetes#54923). We do all of the heavy I/O inside VMs (when building and testing), so we could use qemu's support for it (e.g. --qemu-mbps) ourselves. But without proper k8s support, it would make it hard to share across namespaces.

@dustymabe
Copy link
Member Author

Would it make sense to demote this to a warning for now instead of failing the test outright? If the test times out but we see the warning, it's still useful information to help diagnose the flakiness.

possibly, but I don't know if we have a mechanism for that. The only drawback I can think of is that waiting for the timeout means wasted time.

I bet the larger issue though is I/O,

exactly. In my mind we'd just need to somehow drop our parallelism if we were seeing this very often.

@dustymabe
Copy link
Member Author

Another thing we could do is try to classify failures at runtime and then do something like --allow-rerun-success failureClass=kernelsoftlockup.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request May 18, 2023
To hopefully get fewer issues with soft lockups
cc coreos/coreos-assembler#3450 (comment)
cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request May 18, 2023
To hopefully get fewer issues with soft lockups
cc coreos/coreos-assembler#3450 (comment)
@dustymabe
Copy link
Member Author

@cgwalters I've opened a revert for this in #3481

dustymabe added a commit to dustymabe/coreos-assembler that referenced this pull request May 30, 2023
This patch enables us to have console checks that are non-fatal and
will print a log message to the screen and nothing more, which was
stated as desirable in [1].

This commit also re-works the implementation of the console/journal
checks in runTest() to deduplicate code. It has the side-effect of
making SkipConsoleWarnings apply to journal checks too, but I think
that's actually a benefit and not a negative.

[1] coreos#3450 (comment)
@dustymabe
Copy link
Member Author

#3494 has a commit to re-enable this check, but smarter this time.

dustymabe added a commit that referenced this pull request May 31, 2023
This patch enables us to have console checks that are non-fatal and
will print a log message to the screen and nothing more, which was
stated as desirable in [1].

This commit also re-works the implementation of the console/journal
checks in runTest() to deduplicate code. It has the side-effect of
making SkipConsoleWarnings apply to journal checks too, but I think
that's actually a benefit and not a negative.

[1] #3450 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants