HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

sempervictus · 2023-12-14T00:50:07Z

System information

Type	Version/Name
Distribution Name	arch
Distribution Version	current
Kernel Version	6.6 (grsec)
Architecture	x64
OpenZFS Version	2.2.2

Describe the problem you're observing

Upstream Linux doesn't guard memory nearly as well as Grsecurity, "allowing" kmaping bv_offset + skip > PAGE_SIZE under x86_64; this will likely crash under x86 however. The "hacked together in a few minutes fix" (by much better hackers) currently keeping my system alive changes the kmap and memcpys to use PAGE_MASK magic to ensure we don't violate the boundary; but there should probably be a zfs-correct solution to this situation as we're occasionally performing very low level memory operations "illegally" in kernels without enforcement for the constraint which will eventually occur (and may have impacts on other architectures).

Given that this bug was found in UIO memory management/movement code, i'm curious as to whether it poses any risk to data integrity if actually tripping a fault (and not crashing but losing an IO or segment thereof) under other runtime conditions since the read-back could similarly fail to kmap incorrectly.

Describe how to reproduce the problem

Use a kernel which enforces kmap constraint for PAGE_SIZEd map targets (or write an assertion into the zfs_kmap* macro to do so).
Perform a large send/recv on the same system with heavily fragmented ZVOLs (seems to be what tripped mine) or run a bunch of import/export passes for a pool containing such ZVOLs.
Observe violation

Include any warning/errors/backtraces from the system logs

root@svl-arch00 ~]# zpool import dpool -t tpool -R /mnt/dst -N
[  155.381372][ T2338] BUG: unable to handle page fault for address: ffffffffff53d000 (fixmap[slot 706 local idx #31 cpu 1]+0x0/0x1000)
[  155.382255][ T2338] #PF: supervisor write access in kernel mode
[  155.382668][ T2338] #PF: error_code(0x0002) - not-present page
[  155.383092][ T2338] PGD 0xffff888002898ff8 0000000002ad1063 
[  155.383092][ T2338] P4D 0xffff888002898ff8 0000000002ad1063 
[  155.383092][ T2338] PUD 0xffff888002ad1ff8 0000000002bd5063 
[  155.383092][ T2338] PMD 0xffff888002bd5fd0 8000000002de7063 
[  155.383092][ T2338] PTE 0xffff888002de79e8 0000000000000000 0xffff888000000000
[  155.384988][ T2338] Oops: 0002 [#1] PREEMPT SMP
[  155.385281][ T2338] CPU: 2 PID: 2338 Comm: zvol Tainted: G                TN  6.6.6 #1 49413849afb8da965cdd4351c9c93e35cb1207f4
[  155.386001][ T2338] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  155.386636][ T2338] RIP: 0010:[<ffffffff824ebae9>] memcpy+0x9/0x20
[  155.387046][ T2338] Code: 00 c3 cc 0f 1f 40 00 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 48 b8 00 00 00 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 d1 <f3> a4 48 0f ba 2c 24 3f c3 cc 0f 1f 40 00 66 0f 1f 84 00 00 00 00
[  155.388274][ T2338] RSP: 0000:ffffc90002843c68 EFLAGS: 00010246
[  155.388659][ T2338] RAX: ffffffffff53c000 RBX: ffffc90002843dd8 RCX: 0000000000001000
[  155.389161][ T2338] RDX: 0000000000002000 RSI: ffff888160833000 RDI: ffffffffff53d000
[  155.389667][ T2338] RBP: ffffc90002843d40 R08: 0000000000000000 R09: ffff888160832000
[  155.390166][ T2338] R10: ffffc90002843d58 R11: ffff888162152728 R12: c0fa228c307599a0
[  155.390693][ T2338] R13: 0000000000002000 R14: 0000000000000000 R15: ffff8881591ca040
[  155.391194][ T2338] RAX: fixmap[slot 707 local idx #0 cpu 2]+0x0/0x1000
[  155.391626][ T2338] RBX: vmalloc[copy_process]+0x683/0x3880
[  155.391999][ T2338] RSI: zio_buf_comb_8192+0x1000/0x2000 [slab object]
[  155.392418][ T2338] RDI: fixmap[slot 706 local idx #31 cpu 1]+0x0/0x1000
[  155.392844][ T2338] RBP: vmalloc[copy_process]+0x683/0x3880
[  155.393200][ T2338] RSP: vmalloc[copy_process]+0x683/0x3880
[  155.393547][ T2338] R09: zio_buf_comb_8192+0x0/0x2000 [slab object]
[  155.393955][ T2338] R10: vmalloc[copy_process]+0x683/0x3880
[  155.394328][ T2338] R11: zio_cache+0x0/0x500 [slab object]
[  155.394687][ T2338] R15: autoslab_size_M_fops_P_block_fops_62_10_62_10_S_512_A_8_n_11+0x40/0x200 [slab object]
[  155.395331][ T2338] FS:  0000000000000000(0000) GS:ffff88823d400000(0000) knlGS:0000000000000000
[  155.395893][ T2338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  155.396289][ T2338] CR2: ffffffffff53d000 CR3: 0000000002898004 CR4: 0000000000760ef0 shadow CR4: 0000000000760ef0
[  155.396948][ T2338] CR2: fixmap[slot 706 local idx #31 cpu 1]+0x0/0x1000
[  155.397365][ T2338] PKRU: 55555554
[  155.397583][ T2338] ASID: 0003
[  155.397786][ T2338] Stack:
[  155.397963][ T2338]  ffffffff81942409 3f05dd73b10c3321 3f05dd73b10c257c 0000000000000000
[  155.398476][ T2338]  0000000000000010 0000000000002000 0000000000000000 ffff888120eb91c0
[  155.398982][ T2338]  ffff888120eb9288 0000000000002000 0000000000000000 0000000000002000
[  155.399504][ T2338] Call Trace:
[  155.399711][ T2338]  <TASK>
[  155.400045][ T2338]  [<ffffffff81942409>] zfs_uiomove_bvec_impl+0x159/0x430 ffffc90002843c68
[  155.400704][ T2338]  [<ffffffff8179bcdc>] ? dmu_read_uio_dnode+0xdc/0x170 ffffc90002843d48
[  155.401340][ T2338]  [<ffffffff8179bcdc>] dmu_read_uio_dnode+0xdc/0x170 ffffc90002843d50
[  155.401950][ T2338]  [<ffffffff8197147d>] zvol_read+0x14d/0x550 ffffc90002843da8
[  155.402514][ T2338]  [<ffffffff819718a1>] zvol_read_task+0x21/0x80 ffffc90002843e40
[  155.403099][ T2338]  [<ffffffff81668e90>] taskq_thread+0x3c0/0x780 ffffc90002843e58
[  155.403689][ T2338]  [<ffffffff8121da30>] ? try_to_wake_up+0x580/0x580 ffffc90002843eb8
[  155.404295][ T2338]  [<ffffffff81971880>] ? zvol_read+0x550/0x550 ffffc90002843f30
[  155.404891][ T2338]  [<ffffffff812037cd>] kthread+0x17d/0x200 ffffc90002843f80
[  155.405479][ T2338]  [<ffffffff81668ad0>] ? taskq_thread_spawn+0xa0/0xa0 ffffc90002843f88
[  155.406094][ T2338]  [<ffffffff81203650>] ? kthread_complete_and_exit+0x40/0x40 ffffc90002843f90
[  155.406740][ T2338]  [<ffffffff8114b238>] ret_from_fork+0x98/0xc0 ffffc90002843fb8
[  155.407313][ T2338]  [<ffffffff81203650>] ? kthread_complete_and_exit+0x40/0x40 ffffc90002843fc0
[  155.407965][ T2338]  [<ffffffff8110737f>] ret_from_fork_asm+0x3d/0xbe ffffc90002843fe0
[  155.408864][ T2338]  </TASK>

The text was updated successfully, but these errors were encountered:

ryao · 2023-12-14T18:29:29Z

Nice find. At a glance, this looks serious and might explain some other bug reports.

sempervictus · 2023-12-14T20:05:19Z

Thanks, though its really the OSS team who "caught it" with their most recent tightening up of memory access in the kernel. So far i haven't come up with "anything smarter" than ensuring that macro tests the sizes at least in debug builds so we can find out which callers are doing this. I dont know if there's any safer/smarter way than chunking those kmap ops iteratively but i'm guessing the various C wizards around these parts do.

bspengler-oss · 2023-12-14T20:37:42Z

For some more context, the kmap_* API only maps a single page at a time, but the offsets being used for some accesses to those maps exceeded the page boundary. On x64 and others that don't use HIGHMEM, you simply get back the lowmem address, so as long as the memory is physically contiguous, things work even if it's wrong per the API. On i386 and other archs making use of HIGHMEM, it would instead be accessing some other kmap and use the wrong data for reading/writing.

ryao · 2023-12-14T20:46:15Z

For some more context, the kmap_* API only maps a single page at a time, but the offsets being used for some accesses to those maps exceeded the page boundary. On x64 and others that don't use HIGHMEM, you simply get back the lowmem address, so as long as the memory is physically contiguous, things work even if it's wrong per the API. On i386 and other archs making use of HIGHMEM, it would instead be accessing some other kmap and use the wrong data for reading/writing.

That explains why things are not horribly broken on amd64, despite this making things look horribly broken to me.

robn · 2023-12-17T03:40:03Z

I've been looking into this a bit today. Its not been easy to confirm that kmap_* is limited to single pages, but it seems plausible enough so I'm prepared to accept it. I'm still not sure what it means on systems with not-4K pages (but where PAGE_SIZE is correct), or for compound pages (comments in the kernel suggest its one real memory page per map, which again, makes sense).

@sempervictus can you tell me more about this "hacked together in a few minutes fix"? Since I don't have any hardware that would naturally run into this (or I'd know about it!) and I don't have access to the grsec patches, any change I put in is going to be guesswork.

I've got a small patch that adds asserts to catch uses of mapped areas that run past a single page, and they are trivally easy to trip, which suggests to me that they're not really representative of the problem. I'd be surprised if it was that easy, because there aren't widespread reports of corruption. Unless its actually "wrong" by the API but also basically never happens in practice.

Any and all info appreciated.

bspengler-oss · 2023-12-17T18:41:20Z

It's possible that if you convert the ZFS aliases for kmap_atomic / kunmap_atomic to kmap_local_page/kunmap_local and then enable CONFIG_DEBUG_KMAP_LOCAL, CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP, and CONFIG_DEBUG_HIGHMEM that the issue will then be reproducible with a vanilla kernel.

The second option in particular will force the map similar to real highmem systems, so the lowmem alias can't mask the issue.

BTW, the relevant documentation is in Documentation/mm/highmem.rst:

[Legacy documentation]

This permits a very short duration mapping of a single page. Since the
mapping is restricted to the CPU that issued it, it performs well, but
the issuing task is therefore required to stay on that CPU until it has
finished, lest some other task displace its mappings.

robn · 2023-12-18T00:00:48Z

It's possible that if you convert the ZFS aliases for kmap_atomic / kunmap_atomic to kmap_local_page/kunmap_local and then enable CONFIG_DEBUG_KMAP_LOCAL, CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP, and CONFIG_DEBUG_HIGHMEM that the issue will then be reproducible with a vanilla kernel.

@bspengler-oss thanks, I'll give that a try sometime. Quick check looks like it all requires CONFIG_HIGHMEM, and I can't have that on x86_64, only i386, so I'll to spend a little more time getting a userspace up for that. No big deal, just more time. I appreciate the tips though!

This permits a very short duration mapping of a single page.

Yeah, this sort of thing is why I said "hard to confirm but probably", because "page" is an overloaded term. But I believe you all, so its good enough for me at this point.

sempervictus added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 14, 2023

ryao added the Component: Memory Management kernel memory management label Dec 14, 2023

rincebrain mentioned this issue Feb 20, 2024

ZFS generating "kernel BUG at mm/usercopy.c:102!" #15918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

sempervictus commented Dec 14, 2023

ryao commented Dec 14, 2023

sempervictus commented Dec 14, 2023

bspengler-oss commented Dec 14, 2023

ryao commented Dec 14, 2023

robn commented Dec 17, 2023

bspengler-oss commented Dec 17, 2023 •

edited

Loading

robn commented Dec 18, 2023

HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

Comments

sempervictus commented Dec 14, 2023

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

ryao commented Dec 14, 2023

sempervictus commented Dec 14, 2023

bspengler-oss commented Dec 14, 2023

ryao commented Dec 14, 2023

robn commented Dec 17, 2023

bspengler-oss commented Dec 17, 2023 • edited Loading

robn commented Dec 18, 2023

bspengler-oss commented Dec 17, 2023 •

edited

Loading