Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

Open
sempervictus opened this issue Dec 14, 2023 · 7 comments
Open

HIGHMEM violation by kmap in zfs_uiomove_bvec_impl #15668

sempervictus opened this issue Dec 14, 2023 · 7 comments
Labels
Component: Memory Management kernel memory management Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@sempervictus
Copy link
Contributor

System information

Type Version/Name
Distribution Name arch
Distribution Version current
Kernel Version 6.6 (grsec)
Architecture x64
OpenZFS Version 2.2.2

Describe the problem you're observing

Upstream Linux doesn't guard memory nearly as well as Grsecurity, "allowing" kmaping bv_offset + skip > PAGE_SIZE under x86_64; this will likely crash under x86 however. The "hacked together in a few minutes fix" (by much better hackers) currently keeping my system alive changes the kmap and memcpys to use PAGE_MASK magic to ensure we don't violate the boundary; but there should probably be a zfs-correct solution to this situation as we're occasionally performing very low level memory operations "illegally" in kernels without enforcement for the constraint which will eventually occur (and may have impacts on other architectures).

Given that this bug was found in UIO memory management/movement code, i'm curious as to whether it poses any risk to data integrity if actually tripping a fault (and not crashing but losing an IO or segment thereof) under other runtime conditions since the read-back could similarly fail to kmap incorrectly.

Describe how to reproduce the problem

  1. Use a kernel which enforces kmap constraint for PAGE_SIZEd map targets (or write an assertion into the zfs_kmap* macro to do so).
  2. Perform a large send/recv on the same system with heavily fragmented ZVOLs (seems to be what tripped mine) or run a bunch of import/export passes for a pool containing such ZVOLs.
  3. Observe violation

Include any warning/errors/backtraces from the system logs

root@svl-arch00 ~]# zpool import dpool -t tpool -R /mnt/dst -N
[  155.381372][ T2338] BUG: unable to handle page fault for address: ffffffffff53d000 (fixmap[slot 706 local idx #31 cpu 1]+0x0/0x1000)
[  155.382255][ T2338] #PF: supervisor write access in kernel mode
[  155.382668][ T2338] #PF: error_code(0x0002) - not-present page
[  155.383092][ T2338] PGD 0xffff888002898ff8 0000000002ad1063 
[  155.383092][ T2338] P4D 0xffff888002898ff8 0000000002ad1063 
[  155.383092][ T2338] PUD 0xffff888002ad1ff8 0000000002bd5063 
[  155.383092][ T2338] PMD 0xffff888002bd5fd0 8000000002de7063 
[  155.383092][ T2338] PTE 0xffff888002de79e8 0000000000000000 0xffff888000000000
[  155.384988][ T2338] Oops: 0002 [#1] PREEMPT SMP
[  155.385281][ T2338] CPU: 2 PID: 2338 Comm: zvol Tainted: G                TN  6.6.6 #1 49413849afb8da965cdd4351c9c93e35cb1207f4
[  155.386001][ T2338] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  155.386636][ T2338] RIP: 0010:[<ffffffff824ebae9>] memcpy+0x9/0x20
[  155.387046][ T2338] Code: 00 c3 cc 0f 1f 40 00 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 48 b8 00 00 00 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 d1 <f3> a4 48 0f ba 2c 24 3f c3 cc 0f 1f 40 00 66 0f 1f 84 00 00 00 00
[  155.388274][ T2338] RSP: 0000:ffffc90002843c68 EFLAGS: 00010246
[  155.388659][ T2338] RAX: ffffffffff53c000 RBX: ffffc90002843dd8 RCX: 0000000000001000
[  155.389161][ T2338] RDX: 0000000000002000 RSI: ffff888160833000 RDI: ffffffffff53d000
[  155.389667][ T2338] RBP: ffffc90002843d40 R08: 0000000000000000 R09: ffff888160832000
[  155.390166][ T2338] R10: ffffc90002843d58 R11: ffff888162152728 R12: c0fa228c307599a0
[  155.390693][ T2338] R13: 0000000000002000 R14: 0000000000000000 R15: ffff8881591ca040
[  155.391194][ T2338] RAX: fixmap[slot 707 local idx #0 cpu 2]+0x0/0x1000
[  155.391626][ T2338] RBX: vmalloc[copy_process]+0x683/0x3880
[  155.391999][ T2338] RSI: zio_buf_comb_8192+0x1000/0x2000 [slab object]
[  155.392418][ T2338] RDI: fixmap[slot 706 local idx #31 cpu 1]+0x0/0x1000
[  155.392844][ T2338] RBP: vmalloc[copy_process]+0x683/0x3880
[  155.393200][ T2338] RSP: vmalloc[copy_process]+0x683/0x3880
[  155.393547][ T2338] R09: zio_buf_comb_8192+0x0/0x2000 [slab object]
[  155.393955][ T2338] R10: vmalloc[copy_process]+0x683/0x3880
[  155.394328][ T2338] R11: zio_cache+0x0/0x500 [slab object]
[  155.394687][ T2338] R15: autoslab_size_M_fops_P_block_fops_62_10_62_10_S_512_A_8_n_11+0x40/0x200 [slab object]
[  155.395331][ T2338] FS:  0000000000000000(0000) GS:ffff88823d400000(0000) knlGS:0000000000000000
[  155.395893][ T2338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  155.396289][ T2338] CR2: ffffffffff53d000 CR3: 0000000002898004 CR4: 0000000000760ef0 shadow CR4: 0000000000760ef0
[  155.396948][ T2338] CR2: fixmap[slot 706 local idx #31 cpu 1]+0x0/0x1000
[  155.397365][ T2338] PKRU: 55555554
[  155.397583][ T2338] ASID: 0003
[  155.397786][ T2338] Stack:
[  155.397963][ T2338]  ffffffff81942409 3f05dd73b10c3321 3f05dd73b10c257c 0000000000000000
[  155.398476][ T2338]  0000000000000010 0000000000002000 0000000000000000 ffff888120eb91c0
[  155.398982][ T2338]  ffff888120eb9288 0000000000002000 0000000000000000 0000000000002000
[  155.399504][ T2338] Call Trace:
[  155.399711][ T2338]  <TASK>
[  155.400045][ T2338]  [<ffffffff81942409>] zfs_uiomove_bvec_impl+0x159/0x430 ffffc90002843c68
[  155.400704][ T2338]  [<ffffffff8179bcdc>] ? dmu_read_uio_dnode+0xdc/0x170 ffffc90002843d48
[  155.401340][ T2338]  [<ffffffff8179bcdc>] dmu_read_uio_dnode+0xdc/0x170 ffffc90002843d50
[  155.401950][ T2338]  [<ffffffff8197147d>] zvol_read+0x14d/0x550 ffffc90002843da8
[  155.402514][ T2338]  [<ffffffff819718a1>] zvol_read_task+0x21/0x80 ffffc90002843e40
[  155.403099][ T2338]  [<ffffffff81668e90>] taskq_thread+0x3c0/0x780 ffffc90002843e58
[  155.403689][ T2338]  [<ffffffff8121da30>] ? try_to_wake_up+0x580/0x580 ffffc90002843eb8
[  155.404295][ T2338]  [<ffffffff81971880>] ? zvol_read+0x550/0x550 ffffc90002843f30
[  155.404891][ T2338]  [<ffffffff812037cd>] kthread+0x17d/0x200 ffffc90002843f80
[  155.405479][ T2338]  [<ffffffff81668ad0>] ? taskq_thread_spawn+0xa0/0xa0 ffffc90002843f88
[  155.406094][ T2338]  [<ffffffff81203650>] ? kthread_complete_and_exit+0x40/0x40 ffffc90002843f90
[  155.406740][ T2338]  [<ffffffff8114b238>] ret_from_fork+0x98/0xc0 ffffc90002843fb8
[  155.407313][ T2338]  [<ffffffff81203650>] ? kthread_complete_and_exit+0x40/0x40 ffffc90002843fc0
[  155.407965][ T2338]  [<ffffffff8110737f>] ret_from_fork_asm+0x3d/0xbe ffffc90002843fe0
[  155.408864][ T2338]  </TASK>
@sempervictus sempervictus added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 14, 2023
@ryao
Copy link
Contributor

ryao commented Dec 14, 2023

Nice find. At a glance, this looks serious and might explain some other bug reports.

@ryao ryao added the Component: Memory Management kernel memory management label Dec 14, 2023
@sempervictus
Copy link
Contributor Author

Thanks, though its really the OSS team who "caught it" with their most recent tightening up of memory access in the kernel. So far i haven't come up with "anything smarter" than ensuring that macro tests the sizes at least in debug builds so we can find out which callers are doing this. I dont know if there's any safer/smarter way than chunking those kmap ops iteratively but i'm guessing the various C wizards around these parts do.

@bspengler-oss
Copy link

For some more context, the kmap_* API only maps a single page at a time, but the offsets being used for some accesses to those maps exceeded the page boundary. On x64 and others that don't use HIGHMEM, you simply get back the lowmem address, so as long as the memory is physically contiguous, things work even if it's wrong per the API. On i386 and other archs making use of HIGHMEM, it would instead be accessing some other kmap and use the wrong data for reading/writing.

@ryao
Copy link
Contributor

ryao commented Dec 14, 2023

For some more context, the kmap_* API only maps a single page at a time, but the offsets being used for some accesses to those maps exceeded the page boundary. On x64 and others that don't use HIGHMEM, you simply get back the lowmem address, so as long as the memory is physically contiguous, things work even if it's wrong per the API. On i386 and other archs making use of HIGHMEM, it would instead be accessing some other kmap and use the wrong data for reading/writing.

That explains why things are not horribly broken on amd64, despite this making things look horribly broken to me.

@robn
Copy link
Member

robn commented Dec 17, 2023

I've been looking into this a bit today. Its not been easy to confirm that kmap_* is limited to single pages, but it seems plausible enough so I'm prepared to accept it. I'm still not sure what it means on systems with not-4K pages (but where PAGE_SIZE is correct), or for compound pages (comments in the kernel suggest its one real memory page per map, which again, makes sense).

@sempervictus can you tell me more about this "hacked together in a few minutes fix"? Since I don't have any hardware that would naturally run into this (or I'd know about it!) and I don't have access to the grsec patches, any change I put in is going to be guesswork.

I've got a small patch that adds asserts to catch uses of mapped areas that run past a single page, and they are trivally easy to trip, which suggests to me that they're not really representative of the problem. I'd be surprised if it was that easy, because there aren't widespread reports of corruption. Unless its actually "wrong" by the API but also basically never happens in practice.

Any and all info appreciated.

@bspengler-oss
Copy link

bspengler-oss commented Dec 17, 2023

It's possible that if you convert the ZFS aliases for kmap_atomic / kunmap_atomic to kmap_local_page/kunmap_local and then enable CONFIG_DEBUG_KMAP_LOCAL, CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP, and CONFIG_DEBUG_HIGHMEM that the issue will then be reproducible with a vanilla kernel.

The second option in particular will force the map similar to real highmem systems, so the lowmem alias can't mask the issue.

BTW, the relevant documentation is in Documentation/mm/highmem.rst:

[Legacy documentation]

This permits a very short duration mapping of a single page. Since the
mapping is restricted to the CPU that issued it, it performs well, but
the issuing task is therefore required to stay on that CPU until it has
finished, lest some other task displace its mappings.

@robn
Copy link
Member

robn commented Dec 18, 2023

It's possible that if you convert the ZFS aliases for kmap_atomic / kunmap_atomic to kmap_local_page/kunmap_local and then enable CONFIG_DEBUG_KMAP_LOCAL, CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP, and CONFIG_DEBUG_HIGHMEM that the issue will then be reproducible with a vanilla kernel.

@bspengler-oss thanks, I'll give that a try sometime. Quick check looks like it all requires CONFIG_HIGHMEM, and I can't have that on x86_64, only i386, so I'll to spend a little more time getting a userspace up for that. No big deal, just more time. I appreciate the tips though!

This permits a very short duration mapping of a single page.

Yeah, this sort of thing is why I said "hard to confirm but probably", because "page" is an overloaded term. But I believe you all, so its good enough for me at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants