Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux: avoid splitting abd pages across bios #15414

Closed
wants to merge 3 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Oct 18, 2023

Motivation and Context

In certain situations, its possible for OpenZFS to submit bios to the kernel that are misaligned, or can become misaligned after modification by the kernel. Some devices will reject misaligned bios, leading to IO errors being returned to OpenZFS:

Jul 11 17:55:17 A2-U40 kernel: sd 7:0:90:0: [sdcl] tag#8750 request not aligned to the logical block size
Jul 11 17:55:17 A2-U40 kernel: blk_update_request: I/O error, dev sdcl, sector 16416600664 op 0x1:(WRITE) flags 0x4700 phys_seg 128 prio class 0
Jul 11 17:55:17 A2-U40 kernel: sd 7:0:90:0: [sdcl] tag#8760 request not aligned to the logical block size
Jul 11 17:55:17 A2-U40 kernel: blk_update_request: I/O error, dev sdcl, sector 16416602503 op 0x1:(WRITE) flags 0x700 phys_seg 23 prio class 0
Jul 11 17:55:17 A2-U40 kernel: zio pool=pool-13395907298487379724 vdev=/dev/disk/by-id/wwn-0x5000cca28408a8ac-part1 error=5 type=2 offset=8405291151360 size=1032192 flags=40080c80

This came out of a complex situation at a customer site, involving:

  • a write-heavy load, such that aggregation past 512K happened quite frequently
  • a pool approaching 90% fragmentation, such that gang blocks were being produced (this is significant only insofar as gang blocks are backed by small memory allocations, which exacerbate the problem)

We found that reducing zfs_vdev_aggregation_limit down to ~500K was a sufficient workaround, which also gave us something to go on.

There’s a lot more detail in the commit messages, but the summary is that there are two factors:

  • we frequently create bios that are too large for the underlying device, requiring the kernel to split them
  • we quite often add single pages to bios in two segments, which taken together are aligned, but individually are not

If the kernel splits the bio right in the middle of the “pair” of segments for a full page, then the resulting bios are themselves misaligned, and will be rejected by some devices (notably, the SCSI driver).

The real solution to this issue is to never add less than a full page to a single bio segment, but to do that appears to require significant change in how the SPL memory allocator does its work and in how ABDs are constructed (I’ll describe some of my findings in a comments).

Instead, this PR works around the problem by better controlling the way the bio is created and filled, to ensure it never needs to be split.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Description

This PR ensures that a pair of segments representing a single page are never split by:

  • ensuring that bios are sized correctly for the device they’re being submitted to such that they will never need to be split by the kernel
  • ensuring that abd_bio_map_off() never ends a bio on the “left” side of a split. If both segments of a page can’t be included in the same bio, then stop and wait for the next bio

The commit messages have a lot more discussion of the details.

How Has This Been Tested?

Reproducing this case is complicated, as we need to:

  • generate lots of large IOs;
  • composed of small allocations such that many of them are offset so they span more than one page;
  • so as to raise the possibility that (a) a bio will be split (b) in between two halves of a “split page”;
  • with the device driver or other stage below the block layer actually care about these misalignments and report on them.

The last one means the SCSI driver, unless there’s another I don’t know about.

The first three can be demonstrated by forcing all allocations to be gang blocks (note that gang blocks are not implicated; they’re just a convenient way to force a lot of small IO allocations).

echo 32768 > /sys/module/zfs/parameters/metaslab_force_ganging
echo 100 > /sys/module/zfs/parameters/metaslab_force_ganging_pct

zpool create -o ashift=12 -O recordsize=1M -O compression=off tank raidz1 sda sdb ...

fio --name=gang --directory=/tank --ioengine=pvsync2 --rw=randrw --nrfiles=32 --filesize=1-8M --bsrange=1k-4k --runtime=60 --time_based

Without this PR, this test should yield request not aligned to the logical block size errors almost immediately. With it, everything should be fine.

Earlier versions of these patches are currently in testing at the customer with additional warnings in place to notice any misaligned bios as they are submitted to the block layer. So far they have not reported anything under representative workloads, so I’m confident the approach is right. The question is just whether or not they’re sufficiently general to work everywhere.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

robn added 3 commits October 18, 2023 19:27
When we allocate a bio to submit to the Linux block layer, we would
divide the total ABD size into 4K "pages" and request the bio be large
enought to add that many segments, up to the maximum number of segments
permitted by the kernel block layer.

However, a single "page" in an ABD does not necessarily correspond to
one segment in a bio, because of how OpenZFS breaks ABD pages up when
adding them to a bio based on the structure of their underlying memory
allocation. These "split pages" are added separately, consuming two
segments in the bio.

In the best case, the bio will simply have too few segments in it,
requiring additional bios to be allocated, which adds a little interrupt
and wait time overhead, and possibly limits further optimisation by the
kernel, but is otherwise fine.

The worst case is much worse. The max segments per bio permitted by the
block layer is usually 256, however individual device queues are usually
significantly less than this (eg, most SCSI 4K devices present max 128).
If the incoming ABD is extremely large (eg a write aggregation) it can
result in a bio that the block layer will accept but is too big for a
device queue.

When this happens, the block layer will split the bio. It does this by
bisecting the bio's segment array and producing two bios, one for each
side. If we're unlucky, the split point is in the middle of two
segments that together formed a 4K chunk, which results in two bios that
are not properly 4K-aligned. These will be rejected by the SCSI driver:

    Jul 11 17:55:17 A2-U40 kernel: sd 7:0:90:0: [sdcl] tag#8750 request not aligned to the logical block size
    Jul 11 17:55:17 A2-U40 kernel: blk_update_request: I/O error, dev sdcl, sector 16416600664 op 0x1:(WRITE) flags 0x4700 phys_seg 128 prio class 0
    Jul 11 17:55:17 A2-U40 kernel: sd 7:0:90:0: [sdcl] tag#8760 request not aligned to the logical block size
    Jul 11 17:55:17 A2-U40 kernel: blk_update_request: I/O error, dev sdcl, sector 16416602503 op 0x1:(WRITE) flags 0x700 phys_seg 23 prio class 0
    Jul 11 17:55:17 A2-U40 kernel: zio pool=pool-13395907298487379724 vdev=/dev/disk/by-id/wwn-0x5000cca28408a8ac-part1 error=5 type=2 offset=8405291151360 size=1032192 flags=40080c80

The right solution would be to never add pages to a bio that are not
complete and properly aligned, however doing this requires some serious
work in the SPL memory allocator and a rethink of how ABDs are
constructed and iterated over.

Instead, this commit works around the problem by ensuring the bio is
never big enough to need to be split. If we never allocate more segments
than the device maximum, then the bio won't need to be split, and so the
"split pages" will always remain together and the overall bio properly
aligned.

This also adds a tuneable, vdev_disk_max_segs, to allow setting this
value to be set by the operator. This is very useful for debugging.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
In the previous commit we prevent the kernel from inadvertently
splitting a bio in the middle of a "split page". However, there are
situations where OpenZFS itself can cause this same situation.

If we encounter a split page in the ABD, we need to add two chunks to
the bio, one for each "half" of the split. However, that requires the
bio to have two segments remaining: if there's only one, we must not add
any more to this bio; the caller (`__vdev_disk_physio()`) will then
create a new bio and we can continue to add chunks to it.

A linear ABD actually has multiple pages in a single log run. A "split"
is seen as a short "left" fragment, then a run of proper 4K pages, then
a "right" part. These are all "split", but we don't want to submit two
segments for every 4K chunk. So instead, we track when we see the "left"
part, and its size, and if we get to the end of the bio and we haven't
yet added a matching "right" part of the correct size to fix the
alignment, we compute the right amount to add to close out the bio
properly aligned.

We also make this change for scatter ABDs. Scatter chunks can be split
in the same way, though they are always added one 4K chunk at a time,
never in long runs, so the "left" and "right" parts will always be next
to one. However its theoretically possible for a gang ABD composed of a
linear and scatter ABD to finish the linear part at an "odd" segment
number, and then the scatter part will be adding pages in pairs, and end
up splitting the last page across two bios.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
With the max segments for the bio now generally much smaller, and
varying depending on the device/driver characteristics, the intial array
size of 16 is totally inadequate.

The main reason for getting this right is so we don't have to throw away
all the bios built so far and restart. However there's no particular
reason to do it this way; the bios constructed so far are perfectly
fine; we just ran out of room to track them.

So this commit changes it so if we run out of room in the bio array,
instead of throwing away the contents entirely, we just create a larger
one and copy the existing bios into it, and keep going.

Finally, to decide the right initial size for the bio array, we take a
guess at the likely size by dividing the io size by the page size by the
max device segments. This is likely right most of the time, and if turns
out to not be enough, we simply expand it and keep going.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
@robn
Copy link
Member Author

robn commented Oct 18, 2023

Notes on memory allocation and ABD construction

At its heart, the problem stems from the fact that memory allocations backing ABDs are not always from aligned, contiguous memory ranges, whch abd_bio_map_off() then attempts to compensate for by never submitting an iovec that cross a 4K page boundary.

It appears that most of the time it would actually be safe to submit regions that cross page boundaries, provided they are contiguous. bio_add_page() appears to mean “contiguous region” when it says “page”, and as long as the entire allocation is from one contiguous region, there shouldn’t be a problem.

However, I am not sure about this. The kernel appears to have a lot of different ways to think about a “page”, and its not always clear to me where the rules lie. It gets harder in later kernels with “page folios” and such. So I’m loathe to tackle this without a much clearer understanding.

Its smaller allocations that seem to be the real problem. If I’m understanding it correctly, the SPL allocator may service “small” allocations by carving up larger regions. For example, a 512b allocation and a 4K allocation might be served from a contiguous run, the first allocation at off=0 len=512, the second at off=512 len=4096. Its this second kind of allocation that is the problem here, as abd_bio_map_off() will submit that 4K region as two segments, one from the first page at off=512 len=3584, the other from the second page at off=0 len=512 (this is the type that might be possible to submit as a single “page”).

Linear ABDs appear to be fine and always fine, because they’re rarely small enough to use shared slabs like this, instead getting their own large and isolated allocation (I did think for a while that it might be possible in very rare situations for a linear allocation to be stitched together from bits of multiple different slabs, but I can’t remember why I thought this. If it is possiblwhen the last chunk of a e, it must be exceedingly rare).

Scatter ABDs appear to always be backed by 4K pages, except in the LINEAR_PAGE case, where a true allocation is pressed into service as backing for a single-element scatter ABD; these can be split as with any other small allocation.

Gang ABDs are simply backed by other ABDs, which are naturally non-contiguous. If the bio is not aligned after the last chunk of a child ABD is consumed, then the next one will begin by being artificially split to try to balance out the entire page, even if the child ABD itself is properly aligned.

  • This is why large aggregated writes (which use gang ABDs) with many small IO allocations (by forcing gang blocks (unrelated, even though they share a a name)) makes it so easy to reproduce the issue.
  • This is why scatter ABDs as gang children can still be affected, because the alignment of the starting position impacts the entire ABD.
  • This is why the problem cannot be reproduced before OpenZFS 2.0.0; gang ABDs did not exist there, and all ABDs were linearised before being taken apart for submission to the bio. This new linear ABD is large enough to always be properly aligned.

This is why even if we switch to not splitting contiguous memory ranges, we still have a problem if a gang ABD passes an unaligned ending position on one a child as an unaligned offset on the next child.

I looked into all of this and decided that to create a complete fix that I could feel confident in would require a detailed study into Linux memory allocation and page construction in all its versions up to the very latest, then gaining a solid understanding of the SPL allocators, and a rewrite or significant restructure of ABDs, possibly even replacing it with OS-specific constructions (eg scatterlists). That’s miles above my pay grade at the moment, hence this far more modest change to try to sidestep the conditions that lead to the actual problem.

@robn
Copy link
Member Author

robn commented Oct 18, 2023

Notes on FreeBSD

I did look into to what FreeBSD did here to see if it might be affected by the same issue. I didn’t look very closely, but at least, it scans the ABD up front to see if it looks like a virtually-contiguous region. If it does, it appears to map page pointers directly into the GEOM bio; if not, it linearises the ABD and goes from there.

If there are similar concepts inside GEOM around device segment limits and bio splitting, then FreeBSD might have a similar issue; I did not look that deeply. If there is however, I suspect that any fix is going to be separate unit of work, so I’m not going to investigate further for this PR.

@amotin
Copy link
Member

amotin commented Oct 19, 2023

@robn FreeBSD block layer has no limits on number of segments. It has limits only on total I/O size, and each disk driver must be able to handle worst case of segmentation for misaligned virtually-contiguous buffer, that is: ((size + PAGE_SIZE - 1) / PAGE_SIZE) + 1 segments. So while the code in vdev_geom is quite aggressive in linearizing the ABDs, it should be completely safe after that. Since the begin and the end of the virtually-contiguous buffer do not require any alignment, block layer may split large I/Os on any sector boundary it likes without any problems or complications.

@amotin
Copy link
Member

amotin commented Oct 19, 2023

I haven't looked on your code yet, sorry if I am wrong, but your explanation about use of small buffers and blaming memory allocation for anything sounds wrong to me. The problem I see in attempt to use non-virtually-contiguous buffers with segments not multiple to sector size (ashift). Those segments may appear either as result of aggregation for I/Os not multiple of PAGE_SIZE, even if they are multiple of smaller sector size, or single I/O itself includes several ABDs, not multiple to the PAGE_SIZE (which I understand is a case for gang blocks, but I may be wrong, need to look on it closer). There are still plenty of storage devices that physically can not handle segments not aligned to PAGE_SIZE aside of the begin and the end. Sure there are some like very new NVMe and other controllers that can do arbitrary byte-level scatter/gather, but until those are 100%, we can not rely on that. Also I suppose IOMMU could be able to handle that, but we can not rely on that either. I don't know whether Linux does anything clever in that case, allowing ZFS to understand segmentation requirements for specific storage device, but considering this problem is reported, it either doesn't, or just not used properly. For example, if ZFS tries to write a bunch of random 512 byte blocks, aggregated by vdev code into one 128KB I/O, while underlying device can only do page-aligned segments, then something must do the linearization, or there will be error or data corruption. In case of FreeBSD I've done that explicitly in vdev_geom -- it is primitive and expensive, but it works right. Whether Linux can do better either by supplying ZFS with more data about the device segmentation requirements or do linearization in some cool way like IOMMU with automatic software fallback is the main question for me here.

@amotin
Copy link
Member

amotin commented Oct 19, 2023

rewrite or significant restructure of ABDs, possibly even replacing it with OS-specific constructions (eg scatterlists)

I don't think there can be done much in general from the ABD or allocation side. If pool ashift is smaller than PAGE_SIZE, then doing I/O aggregation we by definition get segments not aligned to PAGE_SIZE, whether it is caused by gang blocks, compression, or just a small objects/blocks. I took a quick look on the gang code and the only "crime" I can guess so far is that it splits large I/O into ashift-aligned chunks, which would be just fine if not the aggregation. Am I missing something?

FreeBSD block layer does not allow not-PAGE_SIZE-aligned segments aside from the first and the last, that partially grows from the hardware limitations, so the choice is only to either linearize the buffer, or not aggregate I/Os not multiple to PAGE_SIZE and hope ZFS never else produce gang ABDs with chunks not aligned to PAGE_SIZE except the begin and the end. Short sub-page I/Os by themselves are not a problem, it is aggregation of them is. But disks may be not very happy to receive a stream of unaggregated 512 byte writes, that is why I preferred serialization. But may be we could do both, trading between extra disk I/Os and memory copying.

I am really curios whether modern Linux allows to do something more than that.

@robn
Copy link
Member Author

robn commented Oct 19, 2023

I may have gone in too hard on blaming the memory allocator as such, but its definitely true that initial problem comes from abd_bio_map_off() trying not to cross PAGE_SIZE boundaries.

I'm sorry I'm not explaining very well; its a complicated (for me) thing that was hard to track down, and its hard to describe in text, and I don't fully understand all of it anyway.

Consider a slab of PAGE_SIZE=4K aligned memory:

|        |        |        |        |        |

This can and does happen:

A = kmem_cache_alloc(cache, 3072);
B = kmem_cache_alloc(cache, 4096);
|        |        |        |        |        |
:  A  ::   B   :

If B is backing an ABD (LINEAR or LINEAR_PAGE scatter), bio_map() (via abd_bio_map_off()) will load it into the bio in two segments, split on the PAGE_SIZE boundary:

[1024][3072]

This alone is ok, because the bio is 4K-aligned. The problem comes if Linux decides to split the bio, and splits between those two segments. Splitting is then more likely when the bio becomes large (on Linux SCSI with ashift=12, that's always at 512K, easy to hit with aggregation), and the misaligned split easier to hit if there's a lot of these "split" pages in the bio. Gang blocks makes it even easier to hit, because their small size make it even more likely that the allocator to produce non-PAGE_SIZE allocations.

I'm fairly sure my patches work around this, by keeping the bios small enough that Linux will never need to split them.

The reason I didn't just fix it outright (eg by copying the FreeBSD method, which I think is good) is both because I assumed the existing method of not crossing PAGE_SIZE boundaries was necessary for some long-forgotten situation, and because I couldn't confirm by reading Linux code exactly what bio_add_page() meant by "page". Linux has so much stuff inside struct page, and there's all kinds of things - contiguious and non-contiguious pages, compound pages, hugepages, and then the recent work in the 6.x series to try and disentangle all of that (including removing PAGE_SIZE as a concept, yow). I also thought that I had some evidence to suggest that kmem_cache_alloc() could return even a large allocation split across "pages" at least in a way that would cause bio_map() to split it.

And thus, "miles above my pay grade" - I feel like there's way too much I need to understand before I could feel confident in the "right" fix, and customer needed something, so here we are.

But that was a few months ago. I think I know my way around vdev_disk well enough these days that I could probably just copy your technique from vdev_geom and see how it goes. As best I can tell, most of the time IO memory is allocated as single contiguous runs, and I think that bio_add_page(bio, page, size, off) by its interface suggests that it is expecting more than a single page, page-aligned, and it seems clear that its not expecting smaller than device block size pages, otherwise it wouldn't be so confident about splitting.

@behlendorf behlendorf self-requested a review October 20, 2023 18:31
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 20, 2023
@allanjude
Copy link
Contributor

The case we were seeing was a 1 MB record, being written as a gang block due to fragmentation. The gang header is 512b, but the pool has an ashift of 12.

The max 'bio's was limited (64 or 128 iirc), so after prefixing the abd of 4k aligned pages with a 512b header (and a 3584 byte footer to keep the I/O aligned), the bio would get split, since since the first item in the vec as not aligned, it would make the entire bio not a multiple of the disks sector size, and it would get refused by the SCSI driver.

@amotin
Copy link
Member

amotin commented Oct 26, 2023

@allanjude I was trying to look on that gang block scenario earlier, but as I could see the 512 bytes is only a logical block size for the gang header. Physical size should be padded by allocation code considering vdev ashift. And zio_vdev_io_start() should allocate linear buffer and copy the data, so vdev with ashift=12 should never see the 512 byte access or ABD.

@shodanshok
Copy link
Contributor

Can this be the root cause for (some of) the issues reported in #10094 ?

@robn
Copy link
Member Author

robn commented Nov 13, 2023

Withdrawing this. I'm just finishing testing on a patch that fixes this original issue properly rather than working around it (largely copying the FreeBSD approach: if the ABD is not sensibly aligned, make a copy). I'll post a PR for that within the next couple of days. Thanks all for the feedback!

@robn
Copy link
Member Author

robn commented Nov 27, 2023

For future readers, #15588 is my "fix it properly".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants