-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linux: avoid splitting abd pages across bios #15414
Conversation
When we allocate a bio to submit to the Linux block layer, we would divide the total ABD size into 4K "pages" and request the bio be large enought to add that many segments, up to the maximum number of segments permitted by the kernel block layer. However, a single "page" in an ABD does not necessarily correspond to one segment in a bio, because of how OpenZFS breaks ABD pages up when adding them to a bio based on the structure of their underlying memory allocation. These "split pages" are added separately, consuming two segments in the bio. In the best case, the bio will simply have too few segments in it, requiring additional bios to be allocated, which adds a little interrupt and wait time overhead, and possibly limits further optimisation by the kernel, but is otherwise fine. The worst case is much worse. The max segments per bio permitted by the block layer is usually 256, however individual device queues are usually significantly less than this (eg, most SCSI 4K devices present max 128). If the incoming ABD is extremely large (eg a write aggregation) it can result in a bio that the block layer will accept but is too big for a device queue. When this happens, the block layer will split the bio. It does this by bisecting the bio's segment array and producing two bios, one for each side. If we're unlucky, the split point is in the middle of two segments that together formed a 4K chunk, which results in two bios that are not properly 4K-aligned. These will be rejected by the SCSI driver: Jul 11 17:55:17 A2-U40 kernel: sd 7:0:90:0: [sdcl] tag#8750 request not aligned to the logical block size Jul 11 17:55:17 A2-U40 kernel: blk_update_request: I/O error, dev sdcl, sector 16416600664 op 0x1:(WRITE) flags 0x4700 phys_seg 128 prio class 0 Jul 11 17:55:17 A2-U40 kernel: sd 7:0:90:0: [sdcl] tag#8760 request not aligned to the logical block size Jul 11 17:55:17 A2-U40 kernel: blk_update_request: I/O error, dev sdcl, sector 16416602503 op 0x1:(WRITE) flags 0x700 phys_seg 23 prio class 0 Jul 11 17:55:17 A2-U40 kernel: zio pool=pool-13395907298487379724 vdev=/dev/disk/by-id/wwn-0x5000cca28408a8ac-part1 error=5 type=2 offset=8405291151360 size=1032192 flags=40080c80 The right solution would be to never add pages to a bio that are not complete and properly aligned, however doing this requires some serious work in the SPL memory allocator and a rethink of how ABDs are constructed and iterated over. Instead, this commit works around the problem by ensuring the bio is never big enough to need to be split. If we never allocate more segments than the device maximum, then the bio won't need to be split, and so the "split pages" will always remain together and the overall bio properly aligned. This also adds a tuneable, vdev_disk_max_segs, to allow setting this value to be set by the operator. This is very useful for debugging. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
In the previous commit we prevent the kernel from inadvertently splitting a bio in the middle of a "split page". However, there are situations where OpenZFS itself can cause this same situation. If we encounter a split page in the ABD, we need to add two chunks to the bio, one for each "half" of the split. However, that requires the bio to have two segments remaining: if there's only one, we must not add any more to this bio; the caller (`__vdev_disk_physio()`) will then create a new bio and we can continue to add chunks to it. A linear ABD actually has multiple pages in a single log run. A "split" is seen as a short "left" fragment, then a run of proper 4K pages, then a "right" part. These are all "split", but we don't want to submit two segments for every 4K chunk. So instead, we track when we see the "left" part, and its size, and if we get to the end of the bio and we haven't yet added a matching "right" part of the correct size to fix the alignment, we compute the right amount to add to close out the bio properly aligned. We also make this change for scatter ABDs. Scatter chunks can be split in the same way, though they are always added one 4K chunk at a time, never in long runs, so the "left" and "right" parts will always be next to one. However its theoretically possible for a gang ABD composed of a linear and scatter ABD to finish the linear part at an "odd" segment number, and then the scatter part will be adding pages in pairs, and end up splitting the last page across two bios. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
With the max segments for the bio now generally much smaller, and varying depending on the device/driver characteristics, the intial array size of 16 is totally inadequate. The main reason for getting this right is so we don't have to throw away all the bios built so far and restart. However there's no particular reason to do it this way; the bios constructed so far are perfectly fine; we just ran out of room to track them. So this commit changes it so if we run out of room in the bio array, instead of throwing away the contents entirely, we just create a larger one and copy the existing bios into it, and keep going. Finally, to decide the right initial size for the bio array, we take a guess at the likely size by dividing the io size by the page size by the max device segments. This is likely right most of the time, and if turns out to not be enough, we simply expand it and keep going. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Notes on memory allocation and ABD constructionAt its heart, the problem stems from the fact that memory allocations backing ABDs are not always from aligned, contiguous memory ranges, whch It appears that most of the time it would actually be safe to submit regions that cross page boundaries, provided they are contiguous. However, I am not sure about this. The kernel appears to have a lot of different ways to think about a “page”, and its not always clear to me where the rules lie. It gets harder in later kernels with “page folios” and such. So I’m loathe to tackle this without a much clearer understanding. Its smaller allocations that seem to be the real problem. If I’m understanding it correctly, the SPL allocator may service “small” allocations by carving up larger regions. For example, a 512b allocation and a 4K allocation might be served from a contiguous run, the first allocation at off=0 len=512, the second at off=512 len=4096. Its this second kind of allocation that is the problem here, as Linear ABDs appear to be fine and always fine, because they’re rarely small enough to use shared slabs like this, instead getting their own large and isolated allocation (I did think for a while that it might be possible in very rare situations for a linear allocation to be stitched together from bits of multiple different slabs, but I can’t remember why I thought this. If it is possiblwhen the last chunk of a e, it must be exceedingly rare). Scatter ABDs appear to always be backed by 4K pages, except in the Gang ABDs are simply backed by other ABDs, which are naturally non-contiguous. If the bio is not aligned after the last chunk of a child ABD is consumed, then the next one will begin by being artificially split to try to balance out the entire page, even if the child ABD itself is properly aligned.
This is why even if we switch to not splitting contiguous memory ranges, we still have a problem if a gang ABD passes an unaligned ending position on one a child as an unaligned offset on the next child. I looked into all of this and decided that to create a complete fix that I could feel confident in would require a detailed study into Linux memory allocation and page construction in all its versions up to the very latest, then gaining a solid understanding of the SPL allocators, and a rewrite or significant restructure of ABDs, possibly even replacing it with OS-specific constructions (eg scatterlists). That’s miles above my pay grade at the moment, hence this far more modest change to try to sidestep the conditions that lead to the actual problem. |
Notes on FreeBSDI did look into to what FreeBSD did here to see if it might be affected by the same issue. I didn’t look very closely, but at least, it scans the ABD up front to see if it looks like a virtually-contiguous region. If it does, it appears to map page pointers directly into the GEOM bio; if not, it linearises the ABD and goes from there. If there are similar concepts inside GEOM around device segment limits and bio splitting, then FreeBSD might have a similar issue; I did not look that deeply. If there is however, I suspect that any fix is going to be separate unit of work, so I’m not going to investigate further for this PR. |
@robn FreeBSD block layer has no limits on number of segments. It has limits only on total I/O size, and each disk driver must be able to handle worst case of segmentation for misaligned virtually-contiguous buffer, that is: ((size + PAGE_SIZE - 1) / PAGE_SIZE) + 1 segments. So while the code in vdev_geom is quite aggressive in linearizing the ABDs, it should be completely safe after that. Since the begin and the end of the virtually-contiguous buffer do not require any alignment, block layer may split large I/Os on any sector boundary it likes without any problems or complications. |
I haven't looked on your code yet, sorry if I am wrong, but your explanation about use of small buffers and blaming memory allocation for anything sounds wrong to me. The problem I see in attempt to use non-virtually-contiguous buffers with segments not multiple to sector size (ashift). Those segments may appear either as result of aggregation for I/Os not multiple of PAGE_SIZE, even if they are multiple of smaller sector size, or single I/O itself includes several ABDs, not multiple to the PAGE_SIZE (which I understand is a case for gang blocks, but I may be wrong, need to look on it closer). There are still plenty of storage devices that physically can not handle segments not aligned to PAGE_SIZE aside of the begin and the end. Sure there are some like very new NVMe and other controllers that can do arbitrary byte-level scatter/gather, but until those are 100%, we can not rely on that. Also I suppose IOMMU could be able to handle that, but we can not rely on that either. I don't know whether Linux does anything clever in that case, allowing ZFS to understand segmentation requirements for specific storage device, but considering this problem is reported, it either doesn't, or just not used properly. For example, if ZFS tries to write a bunch of random 512 byte blocks, aggregated by vdev code into one 128KB I/O, while underlying device can only do page-aligned segments, then something must do the linearization, or there will be error or data corruption. In case of FreeBSD I've done that explicitly in vdev_geom -- it is primitive and expensive, but it works right. Whether Linux can do better either by supplying ZFS with more data about the device segmentation requirements or do linearization in some cool way like IOMMU with automatic software fallback is the main question for me here. |
I don't think there can be done much in general from the ABD or allocation side. If pool ashift is smaller than PAGE_SIZE, then doing I/O aggregation we by definition get segments not aligned to PAGE_SIZE, whether it is caused by gang blocks, compression, or just a small objects/blocks. I took a quick look on the gang code and the only "crime" I can guess so far is that it splits large I/O into ashift-aligned chunks, which would be just fine if not the aggregation. Am I missing something? FreeBSD block layer does not allow not-PAGE_SIZE-aligned segments aside from the first and the last, that partially grows from the hardware limitations, so the choice is only to either linearize the buffer, or not aggregate I/Os not multiple to PAGE_SIZE and hope ZFS never else produce gang ABDs with chunks not aligned to PAGE_SIZE except the begin and the end. Short sub-page I/Os by themselves are not a problem, it is aggregation of them is. But disks may be not very happy to receive a stream of unaggregated 512 byte writes, that is why I preferred serialization. But may be we could do both, trading between extra disk I/Os and memory copying. I am really curios whether modern Linux allows to do something more than that. |
I may have gone in too hard on blaming the memory allocator as such, but its definitely true that initial problem comes from I'm sorry I'm not explaining very well; its a complicated (for me) thing that was hard to track down, and its hard to describe in text, and I don't fully understand all of it anyway. Consider a slab of PAGE_SIZE=4K aligned memory:
This can and does happen:
If B is backing an ABD (LINEAR or LINEAR_PAGE scatter),
This alone is ok, because the bio is 4K-aligned. The problem comes if Linux decides to split the bio, and splits between those two segments. Splitting is then more likely when the bio becomes large (on Linux SCSI with ashift=12, that's always at 512K, easy to hit with aggregation), and the misaligned split easier to hit if there's a lot of these "split" pages in the bio. Gang blocks makes it even easier to hit, because their small size make it even more likely that the allocator to produce non-PAGE_SIZE allocations. I'm fairly sure my patches work around this, by keeping the bios small enough that Linux will never need to split them. The reason I didn't just fix it outright (eg by copying the FreeBSD method, which I think is good) is both because I assumed the existing method of not crossing PAGE_SIZE boundaries was necessary for some long-forgotten situation, and because I couldn't confirm by reading Linux code exactly what And thus, "miles above my pay grade" - I feel like there's way too much I need to understand before I could feel confident in the "right" fix, and customer needed something, so here we are. But that was a few months ago. I think I know my way around |
The case we were seeing was a 1 MB record, being written as a gang block due to fragmentation. The gang header is 512b, but the pool has an ashift of 12. The max 'bio's was limited (64 or 128 iirc), so after prefixing the abd of 4k aligned pages with a 512b header (and a 3584 byte footer to keep the I/O aligned), the bio would get split, since since the first item in the vec as not aligned, it would make the entire bio not a multiple of the disks sector size, and it would get refused by the SCSI driver. |
@allanjude I was trying to look on that gang block scenario earlier, but as I could see the 512 bytes is only a logical block size for the gang header. Physical size should be padded by allocation code considering vdev ashift. And zio_vdev_io_start() should allocate linear buffer and copy the data, so vdev with ashift=12 should never see the 512 byte access or ABD. |
Can this be the root cause for (some of) the issues reported in #10094 ? |
Withdrawing this. I'm just finishing testing on a patch that fixes this original issue properly rather than working around it (largely copying the FreeBSD approach: if the ABD is not sensibly aligned, make a copy). I'll post a PR for that within the next couple of days. Thanks all for the feedback! |
For future readers, #15588 is my "fix it properly". |
Motivation and Context
In certain situations, its possible for OpenZFS to submit bios to the kernel that are misaligned, or can become misaligned after modification by the kernel. Some devices will reject misaligned bios, leading to IO errors being returned to OpenZFS:
This came out of a complex situation at a customer site, involving:
We found that reducing
zfs_vdev_aggregation_limit
down to ~500K was a sufficient workaround, which also gave us something to go on.There’s a lot more detail in the commit messages, but the summary is that there are two factors:
If the kernel splits the bio right in the middle of the “pair” of segments for a full page, then the resulting bios are themselves misaligned, and will be rejected by some devices (notably, the SCSI driver).
The real solution to this issue is to never add less than a full page to a single bio segment, but to do that appears to require significant change in how the SPL memory allocator does its work and in how ABDs are constructed (I’ll describe some of my findings in a comments).
Instead, this PR works around the problem by better controlling the way the bio is created and filled, to ensure it never needs to be split.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Description
This PR ensures that a pair of segments representing a single page are never split by:
abd_bio_map_off()
never ends a bio on the “left” side of a split. If both segments of a page can’t be included in the same bio, then stop and wait for the next bioThe commit messages have a lot more discussion of the details.
How Has This Been Tested?
Reproducing this case is complicated, as we need to:
The last one means the SCSI driver, unless there’s another I don’t know about.
The first three can be demonstrated by forcing all allocations to be gang blocks (note that gang blocks are not implicated; they’re just a convenient way to force a lot of small IO allocations).
Without this PR, this test should yield
request not aligned to the logical block size
errors almost immediately. With it, everything should be fine.Earlier versions of these patches are currently in testing at the customer with additional warnings in place to notice any misaligned bios as they are submitted to the block layer. So far they have not reported anything under representative workloads, so I’m confident the approach is right. The question is just whether or not they’re sufficiently general to work everywhere.
Types of changes
Checklist:
Signed-off-by
.