-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ZVOL sync write performance by using a taskq #10163
Conversation
== Summary == Prior to this change, sync writes to a zvol are processed serially. This commit makes zvols process concurrently outstanding sync writes in parallel, similar to how reads and async writes are already handled. The result is that the throughput of sync writes is tripled. == Background == When a write comes in for a zvol (e.g. over iscsi), it is processed by calling `zvol_request()` to initiate the operation. ZFS is expected to later call `BIO_END_IO()` when the operation completes (possibly from a different thread). There are a limited number of threads that are available to call `zvol_request()` - one one per iscsi client (unless using MC/S). Therefore, to ensure good performance, the latency of `zvol_request()` is important, so that many i/o operations to the zvol can be processed concurrently. In other words, if the client has multiple outstanding requests to the zvol, the zvol should have multiple outstanding requests to the storage hardware (i.e. issue multiple concurrent `zio_t`'s). For reads, and async writes (i.e. writes which can be acknowledged before the data reaches stable storage), `zvol_request()` achieves low latency by dispatching the bulk of the work (including waiting for i/o to disk) to a taskq. The taskq callback (`zvol_read()` or `zvol_write()`) blocks while waiting for the i/o to disk to complete. The `zvol_taskq` has 32 threads (by default), so we can have up to 32 concurrent i/os to disk in service of requests to zvols. However, for sync writes (i.e. writes which must be persisted to stable storage before they can be acknowledged, by calling `zil_commit()`), `zvol_request()` does not use `zvol_taskq`. Instead it blocks while waiting for the ZIL write to disk to complete. This has the effect of serializing sync writes to each zvol. In other words, each zvol will only process one sync write at a time, waiting for it to be written to the ZIL before accepting the next request. The same issue applies to FLUSH operations, for which `zvol_request()` calls `zil_commit()` directly. == Description of change == This commit changes `zvol_request()` to use `taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations. Therefore we can have up to 32 threads (the taskq threads) simultaneously calling `zil_commit()`, for a theoretical performance improvement of up to 32x. To avoid the locking issue described in the comment (which this commit removes), we acquire the rangelock from the taskq callback (e.g. `zvol_write()`) rather than from `zvol_request()`. This applies to all writes (sync and async), reads, and discard operations. This means that multiple simultaneously-outstanding i/o's which access the same block can complete in any order. This was previously thought to be incorrect, but a review of the block device interface requirements revealed that this is fine - the order is inherently not defined. The shorter hold time of the rangelock should also have a slight performance improvement. For an additional slight performance improvement, we use `taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a `kmem_alloc()` and eliminates a failure mode. This applies to all writes (sync and async), reads, and discard operations. == Performance results == We used a zvol as an iscsi target (server) for a Windows initiator (client), with a single connection (the default - i.e. not MC/S). We used `diskspd` to generate a workload with 4 threads, doing 1MB writes to random offsets in the zvol. Without this change we get 231MB/s, and with the change we get 728MB/s, which is 3.15x the original performance. We ran a real-world workload, restoring a MSSQL database, and saw throughput 2.5x the original. We saw more modest performance wins (typically 1.5x-2x) when using MC/S with 4 connections, and with different number of client threads (1, 8, 32). Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for fixing this, Matt.
Codecov Report
@@ Coverage Diff @@
## master #10163 +/- ##
==========================================
- Coverage 79.33% 79.16% -0.18%
==========================================
Files 385 385
Lines 122425 122429 +4
==========================================
- Hits 97127 96920 -207
- Misses 25298 25509 +211
Continue to review full report at Codecov.
|
@filip-paczynski #3720, #5824 and #6477 are all in 0.8.3 (the most recent of these was from 2017). So you could backport just this commit to 0.8.3. |
@ahrens did you test the zvol > iscsi on a separate host or did you run ZFS inside Hyper-V ? |
Sorry if I miss it but is this PR been in any release plan? I hope it is since this is a long waited improvement. |
This PR is in OpenZFS 2.0 (all RC's). It isn't part of 0.8.x. |
== Summary == Prior to this change, sync writes to a zvol are processed serially. This commit makes zvols process concurrently outstanding sync writes in parallel, similar to how reads and async writes are already handled. The result is that the throughput of sync writes is tripled. == Background == When a write comes in for a zvol (e.g. over iscsi), it is processed by calling `zvol_request()` to initiate the operation. ZFS is expected to later call `BIO_END_IO()` when the operation completes (possibly from a different thread). There are a limited number of threads that are available to call `zvol_request()` - one one per iscsi client (unless using MC/S). Therefore, to ensure good performance, the latency of `zvol_request()` is important, so that many i/o operations to the zvol can be processed concurrently. In other words, if the client has multiple outstanding requests to the zvol, the zvol should have multiple outstanding requests to the storage hardware (i.e. issue multiple concurrent `zio_t`'s). For reads, and async writes (i.e. writes which can be acknowledged before the data reaches stable storage), `zvol_request()` achieves low latency by dispatching the bulk of the work (including waiting for i/o to disk) to a taskq. The taskq callback (`zvol_read()` or `zvol_write()`) blocks while waiting for the i/o to disk to complete. The `zvol_taskq` has 32 threads (by default), so we can have up to 32 concurrent i/os to disk in service of requests to zvols. However, for sync writes (i.e. writes which must be persisted to stable storage before they can be acknowledged, by calling `zil_commit()`), `zvol_request()` does not use `zvol_taskq`. Instead it blocks while waiting for the ZIL write to disk to complete. This has the effect of serializing sync writes to each zvol. In other words, each zvol will only process one sync write at a time, waiting for it to be written to the ZIL before accepting the next request. The same issue applies to FLUSH operations, for which `zvol_request()` calls `zil_commit()` directly. == Description of change == This commit changes `zvol_request()` to use `taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations. Therefore we can have up to 32 threads (the taskq threads) simultaneously calling `zil_commit()`, for a theoretical performance improvement of up to 32x. To avoid the locking issue described in the comment (which this commit removes), we acquire the rangelock from the taskq callback (e.g. `zvol_write()`) rather than from `zvol_request()`. This applies to all writes (sync and async), reads, and discard operations. This means that multiple simultaneously-outstanding i/o's which access the same block can complete in any order. This was previously thought to be incorrect, but a review of the block device interface requirements revealed that this is fine - the order is inherently not defined. The shorter hold time of the rangelock should also have a slight performance improvement. For an additional slight performance improvement, we use `taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a `kmem_alloc()` and eliminates a failure mode. This applies to all writes (sync and async), reads, and discard operations. == Performance results == We used a zvol as an iscsi target (server) for a Windows initiator (client), with a single connection (the default - i.e. not MC/S). We used `diskspd` to generate a workload with 4 threads, doing 1MB writes to random offsets in the zvol. Without this change we get 231MB/s, and with the change we get 728MB/s, which is 3.15x the original performance. We ran a real-world workload, restoring a MSSQL database, and saw throughput 2.5x the original. We saw more modest performance wins (typically 1.5x-2x) when using MC/S with 4 connections, and with different number of client threads (1, 8, 32). Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10163
Summary
Prior to this change, sync writes to a zvol are processed serially.
This commit makes zvols process concurrently outstanding sync writes in
parallel, similar to how reads and async writes are already handled.
The result is that the throughput of sync writes is tripled.
Motivation and Context
When a write comes in for a zvol (e.g. over iscsi), it is processed by
calling
zvol_request()
to initiate the operation. ZFS is expected tolater call
BIO_END_IO()
when the operation completes (possibly from adifferent thread). There are a limited number of threads that are
available to call
zvol_request()
- one one per iscsi client (unlessusing MC/S). Therefore, to ensure good performance, the latency of
zvol_request()
is important, so that many i/o operations to the zvolcan be processed concurrently. In other words, if the client has
multiple outstanding requests to the zvol, the zvol should have multiple
outstanding requests to the storage hardware (i.e. issue multiple
concurrent
zio_t
's).For reads, and async writes (i.e. writes which can be acknowledged
before the data reaches stable storage),
zvol_request()
achieves lowlatency by dispatching the bulk of the work (including waiting for i/o
to disk) to a taskq. The taskq callback (
zvol_read()
orzvol_write()
) blocks while waiting for the i/o to disk to complete.The
zvol_taskq
has 32 threads (by default), so we can have up to 32concurrent i/os to disk in service of requests to zvols.
However, for sync writes (i.e. writes which must be persisted to stable
storage before they can be acknowledged, by calling
zil_commit()
),zvol_request()
does not usezvol_taskq
. Instead it blocks whilewaiting for the ZIL write to disk to complete. This has the effect of
serializing sync writes to each zvol. In other words, each zvol will
only process one sync write at a time, waiting for it to be written to
the ZIL before accepting the next request.
The same issue applies to FLUSH operations, for which
zvol_request()
calls
zil_commit()
directly.Description
This commit changes
zvol_request()
to usetaskq_dispatch_ent(zvol_taskq)
for sync writes, and FLUSh operations.Therefore we can have up to 32 threads (the taskq threads)
simultaneously calling
zil_commit()
, for a theoretical performanceimprovement of up to 32x.
To avoid the locking issue described in the comment (which this commit
removes), we acquire the rangelock from the taskq callback (e.g.
zvol_write()
) rather than fromzvol_request()
. This applies to allwrites (sync and async), reads, and discard operations. This means that
multiple simultaneously-outstanding i/o's which access the same block
can complete in any order. This was previously thought to be incorrect,
but a review of the block device interface requirements revealed that
this is fine - the order is inherently not defined. The shorter hold
time of the rangelock should also have a slight performance improvement.
For an additional slight performance improvement, we use
taskq_dispatch_ent()
instead oftaskq_dispatch()
, which avoids akmem_alloc()
and eliminates a failure mode. This applies to allwrites (sync and async), reads, and discard operations.
How Has This Been Tested?
Performance results
We used a zvol as an iscsi target (server) for a Windows initiator
(client), with a single connection (the default - i.e. not MC/S).
We used
diskspd
to generate a workload with 4 threads, doing 1MBwrites to random offsets in the zvol. Without this change we get
231MB/s, and with the change we get 728MB/s, which is >3x the original
performance.
With 32 threads doing 32K writes, we go from 113MB/s to 610MB/s, >5x!
With 64 threads doing 8K writes, we go from 44MB/s to 411MB/s, >9x!
We ran a real-world workload, restoring a MSSQL database, and saw
throughput 2.5x the original.
We saw more modest performance wins (typically 1.5x-2x) when using MC/S
with 4 connections, and with different number of client threads (1, 8,
32).
Types of changes
Checklist:
Signed-off-by
.