Improve ZVOL sync write performance by using a taskq #10163

ahrens · 2020-03-30T15:20:07Z

Summary

Prior to this change, sync writes to a zvol are processed serially.
This commit makes zvols process concurrently outstanding sync writes in
parallel, similar to how reads and async writes are already handled.
The result is that the throughput of sync writes is tripled.

Motivation and Context

When a write comes in for a zvol (e.g. over iscsi), it is processed by
calling zvol_request() to initiate the operation. ZFS is expected to
later call BIO_END_IO() when the operation completes (possibly from a
different thread). There are a limited number of threads that are
available to call zvol_request() - one one per iscsi client (unless
using MC/S). Therefore, to ensure good performance, the latency of
zvol_request() is important, so that many i/o operations to the zvol
can be processed concurrently. In other words, if the client has
multiple outstanding requests to the zvol, the zvol should have multiple
outstanding requests to the storage hardware (i.e. issue multiple
concurrent zio_t's).

For reads, and async writes (i.e. writes which can be acknowledged
before the data reaches stable storage), zvol_request() achieves low
latency by dispatching the bulk of the work (including waiting for i/o
to disk) to a taskq. The taskq callback (zvol_read() or
zvol_write()) blocks while waiting for the i/o to disk to complete.
The zvol_taskq has 32 threads (by default), so we can have up to 32
concurrent i/os to disk in service of requests to zvols.

However, for sync writes (i.e. writes which must be persisted to stable
storage before they can be acknowledged, by calling zil_commit()),
zvol_request() does not use zvol_taskq. Instead it blocks while
waiting for the ZIL write to disk to complete. This has the effect of
serializing sync writes to each zvol. In other words, each zvol will
only process one sync write at a time, waiting for it to be written to
the ZIL before accepting the next request.

The same issue applies to FLUSH operations, for which zvol_request()
calls zil_commit() directly.

Description

This commit changes zvol_request() to use
taskq_dispatch_ent(zvol_taskq) for sync writes, and FLUSh operations.
Therefore we can have up to 32 threads (the taskq threads)
simultaneously calling zil_commit(), for a theoretical performance
improvement of up to 32x.

To avoid the locking issue described in the comment (which this commit
removes), we acquire the rangelock from the taskq callback (e.g.
zvol_write()) rather than from zvol_request(). This applies to all
writes (sync and async), reads, and discard operations. This means that
multiple simultaneously-outstanding i/o's which access the same block
can complete in any order. This was previously thought to be incorrect,
but a review of the block device interface requirements revealed that
this is fine - the order is inherently not defined. The shorter hold
time of the rangelock should also have a slight performance improvement.

For an additional slight performance improvement, we use
taskq_dispatch_ent() instead of taskq_dispatch(), which avoids a
kmem_alloc() and eliminates a failure mode. This applies to all
writes (sync and async), reads, and discard operations.

How Has This Been Tested?

Performance results

We used a zvol as an iscsi target (server) for a Windows initiator
(client), with a single connection (the default - i.e. not MC/S).

We used diskspd to generate a workload with 4 threads, doing 1MB
writes to random offsets in the zvol. Without this change we get
231MB/s, and with the change we get 728MB/s, which is >3x the original
performance.

With 32 threads doing 32K writes, we go from 113MB/s to 610MB/s, >5x!

With 64 threads doing 8K writes, we go from 44MB/s to 411MB/s, >9x!

We ran a real-world workload, restoring a MSSQL database, and saw
throughput 2.5x the original.

We saw more modest performance wins (typically 1.5x-2x) when using MC/S
with 4 connections, and with different number of client threads (1, 8,
32).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

== Summary == Prior to this change, sync writes to a zvol are processed serially. This commit makes zvols process concurrently outstanding sync writes in parallel, similar to how reads and async writes are already handled. The result is that the throughput of sync writes is tripled. == Background == When a write comes in for a zvol (e.g. over iscsi), it is processed by calling `zvol_request()` to initiate the operation. ZFS is expected to later call `BIO_END_IO()` when the operation completes (possibly from a different thread). There are a limited number of threads that are available to call `zvol_request()` - one one per iscsi client (unless using MC/S). Therefore, to ensure good performance, the latency of `zvol_request()` is important, so that many i/o operations to the zvol can be processed concurrently. In other words, if the client has multiple outstanding requests to the zvol, the zvol should have multiple outstanding requests to the storage hardware (i.e. issue multiple concurrent `zio_t`'s). For reads, and async writes (i.e. writes which can be acknowledged before the data reaches stable storage), `zvol_request()` achieves low latency by dispatching the bulk of the work (including waiting for i/o to disk) to a taskq. The taskq callback (`zvol_read()` or `zvol_write()`) blocks while waiting for the i/o to disk to complete. The `zvol_taskq` has 32 threads (by default), so we can have up to 32 concurrent i/os to disk in service of requests to zvols. However, for sync writes (i.e. writes which must be persisted to stable storage before they can be acknowledged, by calling `zil_commit()`), `zvol_request()` does not use `zvol_taskq`. Instead it blocks while waiting for the ZIL write to disk to complete. This has the effect of serializing sync writes to each zvol. In other words, each zvol will only process one sync write at a time, waiting for it to be written to the ZIL before accepting the next request. The same issue applies to FLUSH operations, for which `zvol_request()` calls `zil_commit()` directly. == Description of change == This commit changes `zvol_request()` to use `taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations. Therefore we can have up to 32 threads (the taskq threads) simultaneously calling `zil_commit()`, for a theoretical performance improvement of up to 32x. To avoid the locking issue described in the comment (which this commit removes), we acquire the rangelock from the taskq callback (e.g. `zvol_write()`) rather than from `zvol_request()`. This applies to all writes (sync and async), reads, and discard operations. This means that multiple simultaneously-outstanding i/o's which access the same block can complete in any order. This was previously thought to be incorrect, but a review of the block device interface requirements revealed that this is fine - the order is inherently not defined. The shorter hold time of the rangelock should also have a slight performance improvement. For an additional slight performance improvement, we use `taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a `kmem_alloc()` and eliminates a failure mode. This applies to all writes (sync and async), reads, and discard operations. == Performance results == We used a zvol as an iscsi target (server) for a Windows initiator (client), with a single connection (the default - i.e. not MC/S). We used `diskspd` to generate a workload with 4 threads, doing 1MB writes to random offsets in the zvol. Without this change we get 231MB/s, and with the change we get 728MB/s, which is 3.15x the original performance. We ran a real-world workload, restoring a MSSQL database, and saw throughput 2.5x the original. We saw more modest performance wins (typically 1.5x-2x) when using MC/S with 4 connections, and with different number of client threads (1, 8, 32). Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

ahrens · 2020-03-30T15:24:38Z

Relates to previous changes #3720, #5824 and #6477

tonynguien

Looks great. Thanks for fixing this, Matt.

codecov-io · 2020-03-30T21:34:38Z

Codecov Report

Merging #10163 into master will decrease coverage by 0.17%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #10163      +/-   ##
==========================================
- Coverage   79.33%   79.16%   -0.18%     
==========================================
  Files         385      385              
  Lines      122425   122429       +4     
==========================================
- Hits        97127    96920     -207     
- Misses      25298    25509     +211

Flag	Coverage Δ
#kernel	`79.62% <100.00%> (-0.01%)`	⬇️
#user	`65.85% <ø> (-0.41%)`	⬇️

Impacted Files	Coverage Δ
module/os/linux/zfs/zvol_os.c	`87.00% <100.00%> (-0.78%)`	⬇️
cmd/zdb/zdb_il.c	`30.86% <0.00%> (-24.08%)`	⬇️
cmd/zvol_id/zvol_id_main.c	`76.31% <0.00%> (-5.27%)`	⬇️
cmd/ztest/ztest.c	`77.61% <0.00%> (-3.82%)`	⬇️
module/zfs/spa_errlog.c	`90.83% <0.00%> (-3.06%)`	⬇️
module/os/linux/zfs/vdev_file.c	`80.37% <0.00%> (-1.87%)`	⬇️
cmd/zed/agents/fmd_api.c	`88.61% <0.00%> (-1.78%)`	⬇️
module/zfs/metaslab.c	`94.38% <0.00%> (-1.40%)`	⬇️
cmd/zed/agents/zfs_diagnosis.c	`77.55% <0.00%> (-1.17%)`	⬇️
module/icp/api/kcf_cipher.c	`15.35% <0.00%> (-0.83%)`	⬇️
... and 50 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef3331e...1335fc6. Read the comment docs.

filip-paczynski · 2020-04-02T20:04:43Z

This is great news!
Is applying the changes from #3720, #5824 and #6477 over 0.8.3 will be enough to test this?

ahrens · 2020-04-02T20:28:03Z

@filip-paczynski #3720, #5824 and #6477 are all in 0.8.3 (the most recent of these was from 2017). So you could backport just this commit to 0.8.3.

steven-b89 · 2020-05-12T09:17:44Z

@ahrens did you test the zvol > iscsi on a separate host or did you run ZFS inside Hyper-V ?
I am very interested in this topic to improve sync performance, however I compiled the master from here with no difference at all in sync performance over iscsi.

…g a taskq

georgeyil · 2020-11-11T07:30:03Z

Sorry if I miss it but is this PR been in any release plan? I hope it is since this is a long waited improvement.

ahrens · 2020-11-11T16:50:55Z

This PR is in OpenZFS 2.0 (all RC's). It isn't part of 0.8.x.

== Summary == Prior to this change, sync writes to a zvol are processed serially. This commit makes zvols process concurrently outstanding sync writes in parallel, similar to how reads and async writes are already handled. The result is that the throughput of sync writes is tripled. == Background == When a write comes in for a zvol (e.g. over iscsi), it is processed by calling `zvol_request()` to initiate the operation. ZFS is expected to later call `BIO_END_IO()` when the operation completes (possibly from a different thread). There are a limited number of threads that are available to call `zvol_request()` - one one per iscsi client (unless using MC/S). Therefore, to ensure good performance, the latency of `zvol_request()` is important, so that many i/o operations to the zvol can be processed concurrently. In other words, if the client has multiple outstanding requests to the zvol, the zvol should have multiple outstanding requests to the storage hardware (i.e. issue multiple concurrent `zio_t`'s). For reads, and async writes (i.e. writes which can be acknowledged before the data reaches stable storage), `zvol_request()` achieves low latency by dispatching the bulk of the work (including waiting for i/o to disk) to a taskq. The taskq callback (`zvol_read()` or `zvol_write()`) blocks while waiting for the i/o to disk to complete. The `zvol_taskq` has 32 threads (by default), so we can have up to 32 concurrent i/os to disk in service of requests to zvols. However, for sync writes (i.e. writes which must be persisted to stable storage before they can be acknowledged, by calling `zil_commit()`), `zvol_request()` does not use `zvol_taskq`. Instead it blocks while waiting for the ZIL write to disk to complete. This has the effect of serializing sync writes to each zvol. In other words, each zvol will only process one sync write at a time, waiting for it to be written to the ZIL before accepting the next request. The same issue applies to FLUSH operations, for which `zvol_request()` calls `zil_commit()` directly. == Description of change == This commit changes `zvol_request()` to use `taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations. Therefore we can have up to 32 threads (the taskq threads) simultaneously calling `zil_commit()`, for a theoretical performance improvement of up to 32x. To avoid the locking issue described in the comment (which this commit removes), we acquire the rangelock from the taskq callback (e.g. `zvol_write()`) rather than from `zvol_request()`. This applies to all writes (sync and async), reads, and discard operations. This means that multiple simultaneously-outstanding i/o's which access the same block can complete in any order. This was previously thought to be incorrect, but a review of the block device interface requirements revealed that this is fine - the order is inherently not defined. The shorter hold time of the rangelock should also have a slight performance improvement. For an additional slight performance improvement, we use `taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a `kmem_alloc()` and eliminates a failure mode. This applies to all writes (sync and async), reads, and discard operations. == Performance results == We used a zvol as an iscsi target (server) for a Windows initiator (client), with a single connection (the default - i.e. not MC/S). We used `diskspd` to generate a workload with 4 threads, doing 1MB writes to random offsets in the zvol. Without this change we get 231MB/s, and with the change we get 728MB/s, which is 3.15x the original performance. We ran a real-world workload, restoring a MSSQL database, and saw throughput 2.5x the original. We saw more modest performance wins (typically 1.5x-2x) when using MC/S with 4 connections, and with different number of client threads (1, 8, 32). Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10163

ahrens requested review from behlendorf and pzakha March 30, 2020 15:20

ahrens requested a review from tuxoko March 30, 2020 15:24

ahrens added Component: ZVOL ZFS Volumes Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Mar 30, 2020

tonynguien approved these changes Mar 30, 2020

View reviewed changes

behlendorf approved these changes Mar 30, 2020

View reviewed changes

pzakha approved these changes Mar 30, 2020

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 31, 2020

behlendorf merged commit 0929c4d into openzfs:master Mar 31, 2020

behlendorf mentioned this pull request Mar 31, 2020

Adjust zvol taskq defaults #9874

Closed

12 tasks

matveevandrey mentioned this pull request Apr 16, 2020

Proposed zfs-0.8.4 patchset #10209

Merged

12 tasks

tle211212 added a commit to tle211212/zfs that referenced this pull request May 19, 2020

- Backport openzfs#10163: Improve ZVOL sync write performance by usin…

024f03d

…g a taskq

georgeyil mentioned this pull request Nov 11, 2020

PR involvement #11192

Closed

ahrens deleted the zvol branch November 11, 2020 16:46

behlendorf mentioned this pull request Dec 1, 2020

Is range lock necessary in zvol read/write function? #11259

Closed

ahrens mentioned this pull request Jan 4, 2021

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ZVOL sync write performance by using a taskq #10163

Improve ZVOL sync write performance by using a taskq #10163

ahrens commented Mar 30, 2020 •

edited

Loading

ahrens commented Mar 30, 2020

tonynguien left a comment

codecov-io commented Mar 30, 2020 •

edited

Loading

filip-paczynski commented Apr 2, 2020

ahrens commented Apr 2, 2020

steven-b89 commented May 12, 2020

georgeyil commented Nov 11, 2020

ahrens commented Nov 11, 2020

Improve ZVOL sync write performance by using a taskq #10163

Improve ZVOL sync write performance by using a taskq #10163

Conversation

ahrens commented Mar 30, 2020 • edited Loading

Summary

Motivation and Context

Description

How Has This Been Tested?

Performance results

Types of changes

Checklist:

ahrens commented Mar 30, 2020

tonynguien left a comment

Choose a reason for hiding this comment

codecov-io commented Mar 30, 2020 • edited Loading

Codecov Report

filip-paczynski commented Apr 2, 2020

ahrens commented Apr 2, 2020

steven-b89 commented May 12, 2020

georgeyil commented Nov 11, 2020

ahrens commented Nov 11, 2020

ahrens commented Mar 30, 2020 •

edited

Loading

codecov-io commented Mar 30, 2020 •

edited

Loading