vdev_disk: rewrite BIO filling machinery to avoid split pages #15588

robn · 2023-11-27T11:35:31Z

Motivation and Context

This PR tackles a number of issues in the way BIOs (struct bio) are constructed for submission to the Linux block layer.

BIO segment limits are set incorrectly

The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics).

ZFS counts the number of memory pages in the request ABD (abd_nr_pages_off(), and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs.

Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem.

BIOs are filled inefficiently

The kernel function to add a segment to a BIO (bio_add_page()) takes a struct page pointer, and offset+len within it. struct page represents a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length.

The ZFS functions that count ABD pages and load them into the BIO (abd_nr_pages_off(), bio_map() and abd_bio_map_off()) will never consider a page to be more than PAGE_SIZE (4K), even if the struct page is for multiple pages. In this case, it will load the same struct page into the BIO multiple times, with the offset adjusted each time.

With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split.

Again, this is not a problem on its own.

Incomplete pages are submitted to BIOs

The logic for "never submit more than PAGE_SIZE" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary.

In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such.

Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg:

 -- 4K -- -- 4K -- -- 4K -- -- 4K --
|        |        |        |        |
      :## ######## ######## ######:    [1K, 4K, 4K, 3K]

Such an allocation would be loaded into a BIO as you see:

[1K, 4K, 4K, 3K]

This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least).

In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting.

Aggregation

Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn.

Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as:

[1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]

Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway.

Gang blocks

If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them.

Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment.

Aggregation with gang blocks

Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them.

Description

This commit is a substantial refactor and rewrite of much of vdev_disk to sort all this out.

Configure maximum segment size for device

vdev_bio_max_segs() now returns the ideal maximum size for the device, if available. There's also a tuneable vdev_disk_max_segs to override this, to assist with testing.

ABDs checked up front for page count and alignment

We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO.

If that wouldn't be possible (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in vdev_geom on FreeBSD, adjusted for Linux's variable page size and allocator quirks.

In the end, a count of segments is produced, and this is combined with the max seg count to determine how many BIOs will be needed.

Virtual block IO object

vbio_t is a cleanup and enhancement of the old dio_request_t. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start.

Other cleanups

Lots of cleanup, particularly through vdev_disk_io_start to make it feel a bit more like current ZFS code. Loads of comments too.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Full test suite run passed.

Various customer workloads have been tried (in QA/test environments), including a heavy write workload on 7x raidz3-12 pools. and are not showing any signs of problems, either from within ZFS or from block devices or elsewhere.

(Surprisingly, heavy async write loads are getting ~2.7x throughput, though sync write loads are a more modest ~1.1x. I had expected some improvement, as we're not spending so much time breaking down the ABDs, but I didn't imagine the overheads could be that high).

On stock ZFS the splitting issue is complicated to reproduce, as we need to:

generate lots of large IOs;
composed of small allocations;
so as to raise the possibility that (a) a bio will be split by the kernel (b) at a point that will cause misalignment
with the device driver or other stage below the block layer actually care about these misalignments and report on them.

The last one means the SCSI driver, unless there’s another I don’t know about.

The first three can be demonstrated by forcing all allocations to be gang blocks.

echo 32768 > /sys/module/zfs/parameters/metaslab_force_ganging
echo 100 > /sys/module/zfs/parameters/metaslab_force_ganging_pct

zpool create -o ashift=12 -O recordsize=1M -O compression=off tank raidz1 sda sdb ...

fio --name=gang --directory=/tank --ioengine=pvsync2 --rw=randrw --nrfiles=32 --filesize=1-8M --bsrange=1k-4k --runtime=60 --time_based

Without this PR, this test should yield request not aligned to the logical block size errors almost immediately:

Jul 19 00:26:26 ER6-U35 kernel: sd 0:0:10:0: [sdj] tag#3807 request not aligned to the logical block size
Jul 19 00:26:26 ER6-U35 kernel: blk_update_request: I/O error, dev sdj, sector 973441336 op 0x1:(WRITE) flags 0x4700 phys_seg 128 prio class 0
Jul 19 00:26:26 ER6-U35 kernel: sd 0:0:10:0: [sdj] tag#3833 request not aligned to the logical block size
Jul 19 00:26:26 ER6-U35 kernel: blk_update_request: I/O error, dev sdj, sector 973442332 op 0x1:(WRITE) flags 0x700 phys_seg 13 prio class 0
Jul 19 00:26:26 ER6-U35 kernel: zio pool=pool-655033360322456518 vdev=/dev/disk/by-id/wwn-0x5000cca26491428c-part1 error=5type=2 offset=498393575424 size=561152 flags=40080c80

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/os/linux/zfs/vdev_disk.c

robn · 2023-11-29T12:13:51Z

Last push reworks a few things in response to review comments, and other things discovered while trying to answer them:

always load compound pages from the head in one go
check alignment against device logical block size, not page size
only allocate as many segments in a BIO as we intend to fill
use our own BIO count to free them, so we don't have to check if they exist
use ZIO size instead of ABD size, just to be sure we're not doing too much
make int types a bit more uniform
minor cleanups

amotin · 2023-11-29T18:21:58Z

module/os/linux/zfs/vdev_disk.c

+	zio_delay_interrupt(zio);
+
+	/* Finish cleanup */
+	vbio_free(vbio);


Would we call it before zio_delay_interrupt(), we would not need explicit vbio_return_abd() call.

Yeah, my thought was to get the zio into the pipeline at the earliest possible opportunity, but now I say it loud its a bit silly. I'll fix that up.

Tested doing the vbio cleanup before submitting the zio, and there's a small but measurable throughput drop on a couple of test workloads - about 1-2%. They're still ~2.6x over master so I won't be sad about losing them, but still.

I've rewritten a comment to try to make it clearer why its this way, but if you'd rather the simpler version, I'd be ok with that. Its robn@270b44c.

module/os/linux/zfs/vdev_disk.c

robn · 2023-12-11T05:32:42Z

I believe this is good to go, pending outcome from test results.

@behlendorf If/when this lands, I'd like to also take it to 2.2, but I'm wary of introducing a change so fundamental on a stable series. I'm thinking about a module load parameter that toggles which implement is used, default to the old one. And then we could recommend folk struggling with LUKS (#15533), or seeing rejected SCSI IO (original case), or just wanting to try it out and maybe test the performance change, without risk. Assuming the implementation is sane, would you be interested in that?

(Incidentally, I am dogfooding this on my daily driver now (on top of 2.2.2). Its a pretty boring laptop workload though, but its not nothing).

behlendorf · 2023-12-11T21:14:09Z

Assuming the implementation is sane, would you be interested in that?

I was thinking along the same lines. Assuming we could do this is a reasonable way I think it'd even make sense to keep both versions in the master branch for a while. Enabled by default in master branch and disabled by default in 2.2 until we have enough testing with it.

robn · 2023-12-13T04:19:11Z

Alright, see what you think of that.

Changes to previous:

no longer removing abd_bio_map etc
re-lifed the old dio and physio stuff, pulled it into the same part of the file, renamed vdev_classic_* but no implementation changes
- not even in the "compute BIO segments" function, that is, zfs_vdev_disk_max_segs is ignored for classic
added zfs_vdev_disk_classic tuneable; 0 for "new", 1 for "classic", default 0
- changeable at module load time only (runtime is quite easy, but seems bonkers)
- if you set it explicitly, it'll put something on the kernel log as a debugging assist
- doc advises user to send a bug report if they use it

I've done as much testing as I can on both modes. I managed to induce a split (and SCSI rejection with "classic"), none on "new", as expected. Performance on my sanity check workload remains 2.7x on the new.

sempervictus · 2023-12-15T13:41:15Z

@robn - how safe is this to test on systems where i'd need to restore data if i blow things up? 😄
Far as the page alignment comment in vdev_disk.c - looks like we stumbled on something in the UIO code which tries to map more than one page at a time by miscalculating the size passed to kmap; so think that guarding/testing for mm problems on the target OS/arch is a great idea.

robn · 2023-12-17T04:14:35Z

Top commit adds abd_iterate_page_func, which yield struct page instead of data buffers, handling compound pages as it goes. Since the BIO gets filled with struct pages directly, this obviates the need to map every page along the way.

@behlendorf I'll be interested to know if that changes anything in your test case.

robn · 2023-12-17T04:16:30Z

@sempervictus I think it just works or doesn't - I can't think of any reason why a bug would cause any on-disk corruption. So worst case, you should be able to switch it over to zfs_vdev_disk_classic=1 or just roll back to an earlier version.

amotin

I hope we drop zfs_vdev_disk_classic and the duplication pretty soon. I would not call old code "very stable", considering we know it has issues with #15452.

include/sys/abd_impl.h

robn · 2023-12-18T22:55:39Z

I hope we drop zfs_vdev_disk_classic and the duplication pretty soon. I would not call old code "very stable", considering we know it has issues with #15452.

Maybe "very stable" is too far, but at least its a known quantity that has served well for a long time. I would hope that the new code can get enough testing on 2.2 that we could remove classic for 2.3.

include/sys/abd_impl.h

module/os/linux/zfs/abd_os.c

module/zfs/abd.c

module/os/linux/zfs/vdev_disk.c

This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit c4a13ba)

This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- | | | | | :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit 06a1960)

This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit df2169d)

Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit 72fd834)

Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit c6be6ce)

Before 5.4 we have to do a little math. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit df04efe)

The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 390b448)

This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit f3b85d7)

Light reshuffle to make it a bit more linear to read and get rid of a bunch of args that aren't needed in all cases. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 867178a)

This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit c4a13ba)

This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- | | | | | :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 06a1960)

This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit df2169d)

Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit 72fd834)

Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit c6be6ce)

Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit 72fd834)

Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588 (cherry picked from commit c6be6ce)

Before 5.4 we have to do a little math. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

Light reshuffle to make it a bit more linear to read and get rid of a bunch of args that aren't needed in all cases. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- | | | | | :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#15533 Closes openzfs#15588

robn mentioned this pull request Nov 27, 2023

linux: avoid splitting abd pages across bios #15414

Closed

13 tasks

amotin mentioned this pull request Nov 27, 2023

CKSUM and WRITE errors with 2.2.1 stable, when vdevs are atop LUKS #15533

Closed

behlendorf self-requested a review November 27, 2023 19:59

amotin reviewed Nov 27, 2023

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 28, 2023

robn force-pushed the vdev-disk-refactor branch from fbd8f55 to 12c58b4 Compare November 29, 2023 12:02

amotin reviewed Nov 29, 2023

View reviewed changes

robn force-pushed the vdev-disk-refactor branch 2 times, most recently from a58a95c to e4156eb Compare November 30, 2023 01:05

behlendorf reviewed Nov 30, 2023

View reviewed changes

module/os/linux/zfs/vdev_disk.c Outdated Show resolved Hide resolved

module/os/linux/zfs/vdev_disk.c Outdated Show resolved Hide resolved

robn force-pushed the vdev-disk-refactor branch from e4156eb to f27dc07 Compare November 30, 2023 01:15

codyps mentioned this pull request Dec 5, 2023

pool with ashift 12 on luks2 devices with sector size 4k causes repeated io errors #14533

Open

robn mentioned this pull request Dec 8, 2023

CKSUM and WRITE errors when receiving snapshots or scrubbing (2.2.4, LUKS) #15646

Open

rincebrain mentioned this pull request Dec 9, 2023

Data corruption after TRIM #14513

Open

robn force-pushed the vdev-disk-refactor branch 2 times, most recently from a2a44ad to b5e6138 Compare December 11, 2023 01:44

robn force-pushed the vdev-disk-refactor branch from b5e6138 to c22ae5b Compare December 13, 2023 04:08

amotin reviewed Dec 18, 2023

View reviewed changes

include/sys/abd_impl.h Outdated Show resolved Hide resolved

robn force-pushed the vdev-disk-refactor branch from ac214be to d298fc4 Compare December 18, 2023 22:50

amotin reviewed Dec 19, 2023

View reviewed changes

robn force-pushed the vdev-disk-refactor branch 2 times, most recently from ecd9b3b to f058215 Compare December 20, 2023 01:31

robn mentioned this pull request Mar 27, 2024

[2.2] vdev_disk: rewrite BIO filling machinery to avoid split pages #16032

Merged

13 tasks

This was referenced Aug 12, 2024

OpenZFS for Linux interaction problem with NCQ - potential data loss #15270

Closed

the return of "Unaligned write command" errors #10094

Open

behlendorf mentioned this pull request Oct 15, 2024

Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption #16631

Open

robn mentioned this pull request Oct 25, 2024

vdev_disk: try harder to ensure IO alignment rules #16687

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdev_disk: rewrite BIO filling machinery to avoid split pages #15588

vdev_disk: rewrite BIO filling machinery to avoid split pages #15588

robn commented Nov 27, 2023 •

edited

Loading

robn commented Nov 29, 2023

amotin Nov 29, 2023

robn Nov 29, 2023

robn Nov 30, 2023

robn commented Dec 11, 2023 •

edited

Loading

behlendorf commented Dec 11, 2023

robn commented Dec 13, 2023 •

edited

Loading

sempervictus commented Dec 15, 2023

robn commented Dec 17, 2023

robn commented Dec 17, 2023

amotin left a comment

robn commented Dec 18, 2023

vdev_disk: rewrite BIO filling machinery to avoid split pages #15588

vdev_disk: rewrite BIO filling machinery to avoid split pages #15588

Conversation

robn commented Nov 27, 2023 • edited Loading

Motivation and Context

BIO segment limits are set incorrectly

BIOs are filled inefficiently

Incomplete pages are submitted to BIOs

Aggregation

Gang blocks

Aggregation with gang blocks

Description

Configure maximum segment size for device

ABDs checked up front for page count and alignment

Virtual block IO object

Other cleanups

How Has This Been Tested?

Types of changes

Checklist:

robn commented Nov 29, 2023

amotin Nov 29, 2023

Choose a reason for hiding this comment

robn Nov 29, 2023

Choose a reason for hiding this comment

robn Nov 30, 2023

Choose a reason for hiding this comment

robn commented Dec 11, 2023 • edited Loading

behlendorf commented Dec 11, 2023

robn commented Dec 13, 2023 • edited Loading

sempervictus commented Dec 15, 2023

robn commented Dec 17, 2023

robn commented Dec 17, 2023

amotin left a comment

Choose a reason for hiding this comment

robn commented Dec 18, 2023

robn commented Nov 27, 2023 •

edited

Loading

robn commented Dec 11, 2023 •

edited

Loading

robn commented Dec 13, 2023 •

edited

Loading