-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run CI tests with KASAN #12226
Comments
you can. it's straightforward, but slow like molasses in January. |
#4465 this was discovered and fixed using KASAN. |
You can absolutely do this locally. All you need to do is build a KASAN enabled kernel, then build ZFS as usual and run the test suite. The kernel documentation you linked to shows which CONFIG options need to be enabled. While you're at it I'd also suggest enabling the kernel kmemleak checker. This is something I'd love to enable in the CI but the last time we investigated it the performance impact made it impractical. From what I've read the performance is better with the latest kernels, but I don't know if that means its fast enough to use in the CI environment. |
Well, you could do that, but starting presumably with zstd's merging it will fail to compile unless you make dummy functions for (I ran into this with 4.19.194 and ffdf019, just for reference.) The exact patch I used is:
I'll probably eventually try getting a refined version of this merged, at a minimum with some edit to add: Interactively (..over SSH), with |
That's interesting, clearly I haven't tried this since we incorporated zstd! Thanks for posting the patch, it sounds like we'll want to incorporate some version of your change to sort the build out. It's also encouraging to hear your performance wasn't terrible. My recollection is that interactively it felt fine, but it at least doubled the total run time for the test suite. |
@rincebrain Are you saying you ran the ZTS (with ZFS version ffdf019) on a KASAN kernel, and had no memory corruption issues? |
Oh, no, I would definitely not say that... I just was looking for a specific problem when I tried building KASAN in (...yesterday), and hadn't tried running through ZTS at the time. I have gotten through a ZTS run, though indeed, with at least one KASAN complaint in syslog. I just haven't filed it yet. |
Could you give me that info? (Either email me directly, or just open the bug.) At a minimum, I'd like to sanity-check that I am able to reproduce it. (My ulterior motive here is that I believe that there is a memory corruption issue causing a bug I'm experiencing. That you're finding a memory corruption bug is a "good" sign that I can at least fix some bug of that type.) |
Sure, let me just identify which test(s) were involved and reproduce it on reboot... (I, too, started down this rabbit hole for such suspicions...) |
What about a ready to run automated test environment, instead of a continuous? Every other test available in CI, but with KASAN enabled, that could be run manually or weekly. Not sure if the same applies to ASAN (#12216), or if it could be in continuous CI. In a probable evolution, the ready to run test would enable env vars to get more details from ASAN. |
I still think the overhead for KASAN when configured inline is probably low enough to permit CI usage, assuming A) enough runners for the rate of PR updates and B) increased runtime allowance for ZTS in it (because the overhead is, indeed, not zero, though I got sidetracked by non-KASAN tests before I measured a complete run with and without KASAN on the same commit). Though, I don't know what the thresholds for "too much" are, here - 1.5x runtime? Doubled? Tripled? Similar numbers for RAM on the runners? (In my limited experience, IIRC, using 4GB RAM with and without KASAN ended with the OOM killer murdering every process in the former case before finishing a ZTS run, though I would have expected ARC to be smaller and life to move on...) Though, since AFAICT none of {CentOS,Fedora,Debian,Ubuntu} ship a premade KASAN kernel package, this would require maintenance rebuilding that sometimes...though Linux makes custom kernel packages pretty simple, at least. |
apparently arm64 has a few features that make kasan run better there. |
(Gonna move discussion from #12928 to stop flooding the poor PR.) So, I ran It took 05:06:03, came back with:
and logged three fun things in dmesg - one was #12230, the second was:
(whew, that was long, and I might have repeated a line or two that occurred 5+ times in a row) And the final one:
I can go find out which tests the latter two happened during if they're hard to repro for anyone. Some of the tests failed because I forgot to build scsi-debug into the kernel config. Whoops. |
It seems that I spoke too soon in #12928 (comment), because it got to
(The output mingling is as original from the console.) This seems to point to lua, which is as-expected (#12230), but reading through that it doesn't look like the kernel out-right panicked in that run? Here's the results (though, well, it panicked, so): zts-results.ecYPqF.gz |
I did have it panic once and say the stack was destroyed, though I didn't get a trace from why, when I gave it much less RAM than I thought I had; increasing it made it just complain. |
That's with |
Yeah, no kidding - I was using n=4 and 24 GB. |
Happened again (I filtered by
Results: zts-results.8XEhF3.gz (again, from the guest, which panicked; I guess I could network mount this, but that sounds like an amazing way to triple the run-time) This is my qemu cmdline (line-broken for your viewing pleasure; the smp configuration mimicks the host, except the host is (a) NUMA, obviously, and (b) has twice as many cores/socket):
A pickle indeed. Maybe unballooning will help? (I doubt it from the trace, but it'd be fun. Otherwise I have no clue, since, well.) |
Novel. Last time I used the ballooning driver, it was with Xen 3, so I have no constructive input there. Here is the .config I used with my kASAN kernel, if you'd like to compare it to yours. |
Disabling the balloon seems to have no effect:
(It also hasn't changed QEMU's memory usage, so.); zts-results.lD6EIq.gz In what is an ultimate basic bitch move, I just built the debian kernel packages but added Rudimentary analysis ( CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
-# CONFIG_KASAN_OUTLINE is not set
-CONFIG_KASAN_INLINE=y
+CONFIG_KASAN_OUTLINE=y
+# CONFIG_KASAN_INLINE is not set
CONFIG_KASAN_STACK=y
# CONFIG_KASAN_VMALLOC is not set
# CONFIG_KASAN_MODULE_TEST is not set I assume OUTLINE is the default, since I changed no other lines in the seed config. (It also seems prudent to note that I know jack squat about how these things would interact => not a clue what this realistically means.) |
Yeah, mine was edited make defconfig, so unsurprising it didn't have much
in common.
INLINE means, AIUI, what it says on the tin for KASAN - is it making actual
calls for the kasan shims around everything, or is it inlining them and
laughing at the bloat that ensues?
I could imagine actual calls everywhere would make a significant
difference...
- Rich
…On Fri, Jan 7, 2022 at 12:05 PM наб ***@***.***> wrote:
Disabling the balloon seems to have no effect:
Test: /usr/local/share/zfs/zfs-tests/tests/functional/channel_program/lua_core/tst.return_recursive_table]
[ 1255.851278] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[ 1255.853384] CPU: 2 PID: 95047 Comm: txg_sync Tainted: P B OE 5.15.0-2-amd64 #1 Debian 5.11
[ 1255.855991] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[ 1255.858192] Call Trace:
[ 1255.858865] <TASK>
[ 1255.859455] dump_stack_lvl+0x46/0x5a
[ 1255.860457] panic+0x18b/0x389
[ 1255.861282] ? __warn_printk+0xf3/0xf3
[ 1255.862837] ? __schedule+0xca/0xf90
[ 1255.864394] ? schedule+0x30/0x120
[ 1255.865771] __schedule+0xf8b/0xf90
[ 1255.867219] ? trace_event_raw_event_hrtimer_start+0x1b0/0x1b0
[ 1255.869614] ? io_schedule_timeout+0xb0/0xb0
[ 1255.871356] ? x2apic_send_IPI+0x60/0x70
[ 1255.873033] schedule+0x6d/0x120
[ 1255.874413] schedule_timeout+0xe4/0x1f0
[ 1255.876030] ? usleep_range+0xe0/0xe0
[ 1255.877508] ? try_to_wake_up+0x392/0x910
[ 1255.879227] ? __bpf_trace_tick_stop+0xe0/0xe0
[ 1255.881019] ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
[ 1255.883442] ? __native_queued_spin_unlock+0x9/0x10
[ 1255.885475] ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
[ 1255.888179] __cv_timedwait_common+0x19e/0x2b0 [spl]
[ 1255.890329] ? __cv_wait_idle+0xd0/0xd0 [spl]
[ 1255.892265] ? recalc_sigpending+0x5a/0x70
[ 1255.893919] ? finish_wait+0x100/0x100
[ 1255.895497] ? mutex_unlock+0x80/0xd0
[ 1255.896864] ? bpobj_space+0x10c/0x120 [zfs]
[ 1255.900311] __cv_timedwait_idle+0x9a/0xe0 [spl]
[ 1255.902165] ? __cv_timedwait_sig+0x70/0x70 [spl]
[ 1255.903998] ? __bitmap_weight+0x71/0x90
[ 1255.905528] txg_sync_thread+0x24f/0x760 [zfs]
[ 1255.908229] ? kasan_set_track+0x1c/0x30
[ 1255.910077] ? txg_fini+0x300/0x300 [zfs]
[ 1255.913039] thread_generic_wrapper+0xa8/0xc0 [spl]
[ 1255.914773] ? __thread_exit+0x20/0x20 [spl]
[ 1255.916410] kthread+0x1d2/0x200
[ 1255.917541] ? set_kthread_struct+0x80/0x80
[ 1255.919008] ret_from_fork+0x22/0x30
[ 1255.920208] </TASK>
[ 1255.921244] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0x)
[ 1255.924842] ---[ end Kernel panic - not syncing: corrupted stack end detected inside scheduler ]---
(It also hasn't changed QEMU's memory usage, so.); zts-results.lD6EIq.gz
<https://github.com/openzfs/zfs/files/7830180/zts-results.lD6EIq.gz>
In what is an ultimate basic bitch move, I just built the debian kernel
packages but added CONFIG_KASAN=y where the original had "CONFIG_KASAN is
unset", and installed them on a fresh sid strap: config-5.15.0-2-amd64.gz
<https://github.com/openzfs/zfs/files/7830184/config-5.15.0-2-amd64.gz>;
I can upload the send of the image later, if there's interest.
Rudimentary analysis (git diff) reveals that they're almost entirely
unrelated; grepping for KASAN shows this (-your kasan, +my debian):
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y-# CONFIG_KASAN_OUTLINE is not set-CONFIG_KASAN_INLINE=y+CONFIG_KASAN_OUTLINE=y+# CONFIG_KASAN_INLINE is not set
CONFIG_KASAN_STACK=y
# CONFIG_KASAN_VMALLOC is not set
# CONFIG_KASAN_MODULE_TEST is not set
I assume OUTLINE is the default, since I changed no other lines in the
seed config. (It also seems prudent to note that I know jack squat about
how these things would interact => not a clue what this realistically
means.)
—
Reply to this email directly, view it on GitHub
<#12226 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7MRNWDU5TLYYAQ5FUTUU4MN3ANCNFSM46Q2YCXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hm; quoth lib/Kconfig.kasan:
So, yes, OUTLINE is actual calls, and INLINE doubles .text. Although I wouldn't expect that to make a difference? |
Changed
And then this panic:
I'm running with 64G now but bumping to that made it decide that it's going to run like absolute shit; nevertheless:
and, indeed:
I can't necessarily give it any, uh, more RAM? (I mean I could, but I don't love the idea of swapping out my MX.) And the overall times don't seem to breach more than one CPU, anyway, so?
Here's a send of the image and qemu driver (the boot bundle needs extracted from /boot, or a bootloader installed; this also wants a scratch filesystem at /scratchpsko (i just did |
Curious. I'm wildly speculating that all the outline calls make it more vulnerable to something smashing it in ways it can't recover from? Or I keep getting lucky with my smashing not blowing up the world...I'll try the VM and see if it blows the same way for me, and if swapping the kernel around changes anything. |
Casual 2 cents from papa know-it-all. I just:
16 vCPU/10 GiB VM used, no memory problems (so far).
|
Describe the feature would like to see added to OpenZFS
Can we run the zts/ztest CI with the kernel address sanitization KASAN?
How will this feature improve OpenZFS?
We are more likely to identify kernel memory corruption.
Additional context
#12216 does this for userland.
Can I do this myself by just compiling a kernel with KASAN enabled, and building ZFS as usual? Is there any documentation I should look into for this?
The text was updated successfully, but these errors were encountered: