zvol: use multiple taskqs #15992

ixhamza · 2024-03-13T16:03:45Z

Motivation and Context

The current implementation of zvol uses a single taskq, leading to lock contention under heavy load and consequently decreased throughput. Introducing multiple taskqs and implementing a switch based on IO offset can alleviate this lock contention, thus improving overall throughput.

Description

Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scales dynamically based on the available CPUs in the system, as illustrated below:

                taskq   total
cpus    taskqs  threads threads
------- ------- ------- -------
1       1       32       32
2       1       32       32
4       1       32       32
8       2       16       32
16      3       11       33
32      5       7        35
64      8       8        64
128     11      12       132
256     16      16       256

How Has This Been Tested?

Local zfs-tests and buildbot tests.
Fio performance testing with increasing fio threads.

Testing Environment:
- System: AMD EPYC 7543 32-Core Processor, 128 GB DRAM
- Drives/Pool Configuration: 6x Kioxia NVMe SSD KCM6DVUL1T60 (4 Stripe Data, 1 SLOG, and 1 L2CACHE)
- OS: Debian GNU/Linux 12 (bookworm), 6.6.16
- ARC Size: 128G
- ADS: 128G
- IO Size / Volblock Size: 128K
- Direct IO: 1

FIO tests are conducted using the above configuration, manually configuring all jobs with different offsets and sizes to ensure they do not overlap. The benchmarks for sequential reads are provided below for reference. In the graph, the blue line represents the performance of the current single taskq implementation, while the other lines depict the performance after switching to multiple taskqs with different offsets. As the FIO jobs increase, i.e., 32, the performance with multiple taskqs is nearly double that of the current single taskq implementation.

Falme Graph with 32 fio jobs prior to changes:
Falme Graph with 32 fio jobs with changes applied:

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2024-03-13T16:38:45Z

module/os/linux/zfs/zvol_os.c

@@ -532,6 +543,17 @@ zvol_request_impl(zvol_state_t *zv, struct bio *bio, struct request *rq,
 	}

 	zv_request_task_t *task;
+	zv_taskq_t *ztqs = &zvol_taskqs;
+	int blk_mq_hw_queue = 0;


Mostly a cosmetics, but as I see, rq->mq_hctx->queue_num is unsigned.

amotin

It seems zvol_blk_mq_actual_threads is used only to set zso->tag_set.nr_hw_queues. Would it be better to report there ztqs->tqs_cnt instead and just remove zvol_blk_mq_threads and zvol_blk_mq_actual_threads? Now the value looks pretty arbitrary.

tonyhutter · 2024-03-13T17:33:18Z

@ixhamza this is a great optimization! Did you happen to try one taskq per CPU (and making the taskq a per-cpu variable)? If so, did you try hashing the taskq to the same CPU as the bio/request? I'm wondering if that's going to be better for the blk_mq path, but possibly worse for the non-multithreaded bio path.

It seems zvol_blk_mq_actual_threads is used only to set zso->tag_set.nr_hw_queues. Would it be better to report there ztqs->tqs_cnt instead and just remove zvol_blk_mq_threads and zvol_blk_mq_actual_threads? Now the value looks pretty arbitrary.

That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it. You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation.

ixhamza · 2024-03-13T21:58:59Z

@tonyhutter - thank you for your feedback.

Did you happen to try one taskq per CPU (and making the taskq a per-cpu variable)? If so, did you try hashing the taskq to the same CPU as the bio/request? I'm wondering if that's going to be better for the blk_mq path, but possibly worse for the non-multithreaded bio path.

I haven't tried it myself yet. However, based on my understanding, the current taskq design doesn't seem to allow us to control thread-to-CPU binding. There is a spl_taskq_thread_bind variable but that seems to work globally. Even if it were possible, I believe it could potentially worsen prefetch hits due to request reordering. Let me know if I've missed something or you want me to test some other scenario.

That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it. You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation.

I am benchmarking and would share the results regarding 1:1 ratio of blk_mq workqueues to task queues.

tonyhutter · 2024-03-13T23:25:19Z

However, based on my understanding, the current taskq design doesn't seem to allow us to control thread-to-CPU binding.

Ah you're right - taskq has to be dynamically allocated via taskq_create(), and we don't really have control over what CPU it's pinned to other than spl_taskq_thread_bind, which doesn't give us what we want.

Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

ixhamza · 2024-03-14T20:54:41Z

That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it.

According to my blk-mq benchmarks, 1:1 ratio of blk_mq workqueues to task queues causing bottleneck in sequential write performance, although reads are not affected. Therefore, I am keeping the zvol_blk_mq_threads path unchanged in this PR.

You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation.

Added this in the latest commit, thanks for the suggestion.

tonynguien

Thanks for making this performance improvement. It looks great. I have just one comment.

tonynguien · 2024-03-15T11:46:26Z

module/os/linux/zfs/zvol_os.c

@@ -1570,8 +1593,40 @@ zvol_init(void)
 		zvol_actual_threads = MIN(MAX(zvol_threads, 1), 1024);
 	}

+	/*
+	 * Use atleast 32 zvol_threads but for many core system,


I think we can simplify to make it easer to reason about. One idea is:

Systems with >32 CPUs retain previous behavior of 1 taskq and 32 threads or zvol_threads value

Systems with 32 or greater will scale the number of taskq with CPU count / [6|8] (or some other #define value). And per taskq thread count can be 8 (perhaps making this a tunable)

It would be close to the values and behavior you have and the code would be a little simpler.

Apologies if I misunderstood your reasoning, but it's worth noting that single taskq lock contention can be observed even on systems with fewer than 32 CPUs.

tonynguien · 2024-03-15T11:54:59Z

The flame graph with the code change shows 7 taskqs. The code states there should be 5 taskqs for 32 CPU system. Am I missing something?

ixhamza · 2024-03-15T12:02:25Z

@tonynguien your observation is correct. AMD EPYC 7543 has 64 threads, so there are 8 taskqs in total: https://www.amd.com/en/products/cpu/amd-epyc-7543.

stuartthebruce · 2024-04-04T23:16:41Z

@behlendorf is this a candidate for 2.2.4?

behlendorf · 2024-04-08T19:18:48Z

@stuartthebruce potentially. We did have a few follow fixes which were needed for this change so I'd like to let it soak for a while.

stuartthebruce · 2024-04-08T19:28:31Z

@stuartthebruce potentially. We did have a few follow fixes which were needed for this change so I'd like to let it soak for a while.

Sounds good. FYI, the motivation for me is to be able to replace the following LVM raid5 with ZFS.,

[root@vsmarchive ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] 
md0 : active raid5 nvme9n1[8] nvme8n1[7] nvme7n1[6] nvme6n1[5] nvme5n1[4] nvme4n1[3] nvme3n1[2] nvme2n1[1] nvme1n1[0]
      100018432512 blocks super 1.2 level 5, 32k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      bitmap: 1/94 pages [4KB], 65536KB chunk

However, the current zvol performance isn't quite good enough. It can handle the following average load,

[root@vsmarchive ~]# iostat -xzm
Linux 3.10.0-1160.108.1.el7.x86_64 (vsmarchive) 	04/08/2024 	_x86_64_	(64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.94    0.00    2.19    0.62    0.00   96.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme10n1          0.00     0.00   25.63   43.11     0.04     0.55    17.74     0.00    0.06    0.01    0.09   0.11   0.79
nvme2n1        4059.42  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.54   0.11  18.32
nvme1n1        4059.42  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.55   0.11  18.33
nvme7n1        4059.42  4997.77 1604.45  132.60    29.58    20.04    58.51     0.56    0.32    0.23    1.52   0.11  18.33
nvme5n1        4059.42  4997.77 1604.44  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.53   0.11  18.34
nvme4n1        4059.42  4997.77 1604.44  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.54   0.11  18.35
nvme6n1        4059.43  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.53   0.11  18.35
nvme3n1        4059.43  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.55   0.11  18.35
nvme0n1           0.00     0.00   25.49   43.11     0.04     0.55    17.77     0.00    0.06    0.01    0.09   0.11   0.78
nvme9n1        4059.43  4997.77 1604.46  132.61    29.58    20.04    58.51     0.48    0.27    0.23    0.85   0.11  18.31
nvme8n1        4059.42  4997.77 1604.45  132.60    29.58    20.04    58.51     0.48    0.27    0.23    0.85   0.11  18.31
sda               6.30     0.07    9.43    3.63     0.91     0.12   161.00     0.03    2.37    0.41    7.44   0.38   0.50
sdb               6.30     0.07    6.71    3.63     0.81     0.12   184.66     0.04    3.58    0.46    9.36   0.44   0.45
md127             0.00     0.00    3.52    2.27     0.14     0.12    92.69     0.00    0.00    0.00    0.00   0.00   0.00
md126             0.00     0.00    0.01    0.00     0.00     0.00    61.53     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    2.67    1.91     0.04     0.03    30.92     0.04    8.45    0.22   19.95   2.82   1.29
md0               0.00     0.00 2547.63 41027.40    77.07   160.26    11.15     0.00    0.00    0.00    0.00   0.00   0.00
zd0               0.00     0.00 1675.25  174.80    31.92     0.68    36.09     0.05    0.03    0.03    0.01   0.02   2.94
dm-1              0.00     0.00    0.85    0.26     0.10     0.09   358.75     0.25  224.87    0.80  957.64   3.03   0.33

but not the order of magnitude larger peak loads.

mmatuska · 2024-04-13T07:21:33Z

Possible candidate for zfs-2.2.4-staging?

behlendorf · 2024-04-15T22:52:53Z

@mmatuska yes, I haven't seen any reports of this causing problems.

Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#15992

Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15992

arturpzol · 2024-08-09T10:20:14Z

After upgrading to OpenZFS version 2.2.4 (kernel 5.15), I have observed worse sequential write performance on zvols under various workloads, as shown in the chart below::

The introduction of multiple taskqs appears to negatively impact performance, especially when using a small number of threads.

Even with more threads, the performance improvement is not as significant as what is shown in the graph from ixhamza.

I'm confident that the commit introducing multiple taskqs is responsible for the regression because I reverted this commit in OpenZFS 2.2.4, and the performance reverted to what I experienced with version 2.2.3.

Experimentally, I found that setting zvol_num_taskqs=2 brought performance closer to the levels seen in 2.2.3. However, this requires manually adjusting the parameter. The default value of 0 (which scales internally to prefer 6 threads per taskq) does not seem to be very optimal for sequential write workloads.

It is possible that this issue is hardware-dependent. I performed the tests on the following hardware:

    System 1: Intel Xeon Silver 4309Y, 128 GB RAM, 22 x HDDs
    FIO iosize 128k, zvol volblocksize 128k
    Pool Structure: All disks are configured as single vdevs

I noticed that the author of the commit performed their tests on an AMD EPYC 7543, so it seems the test results might be hardware-dependent.

Additionally, I ran tests on different hardware and observed similar behavior:

    System 2: 2x Intel Xeon Gold 5222, 192 GB RAM, 24 x NVMe
    FIO iosize 128k, zvol volblocksize 128k
    Pool Structure: All disks are configured as single vdevs

What I find most interesting is the lack of growth over ZFS 2.2.3 for more threads. Unless I'm misunderstanding something...

FIO configuration:


[global]
overwrite=1
end_fsync=1
ss=iops_slope:0.2%
ss_dur=30s
ss_ramp=15s
time_based
runtime=1000s
ioengine=libaio
refill_buffers
invalidate=1
nice=-19
prio=1
fallocate=none
group_reporting
gtod_reduce=0
randrepeat=0
ramp_time=0s
direct=1
log_avg_msec=1000
log_hist_msec=1000
per_job_logs=0
random_generator=lfsr
filename=/dev/Pool-1/zvol128kb
rw=read
percentage_random=0
rwmixwrite=0
iodepth=1
bs=128K
write_iops_log=logs/iops/seq_read_T64Q1
write_lat_log=logs/lat/seq_read_T64Q1
write_hist_log=logs/hist/seq_read_T64Q1
[0]
size=3125M
offset=0M
[1]
size=3125M
offset=3125M
[2]
size=3125M
offset=6250M
[3]
size=3125M
offset=9375M
[4]
size=3125M
offset=12500M
[5]
size=3125M
offset=15625M
[6]
size=3125M
offset=18750M

ixhamza · 2024-08-09T10:59:18Z

@arturpzol - This change helps avoid single taskq lock contention, which can become a bottleneck on fast storage under a heavy workload. If you are not experiencing lock contention, this change may not be beneficial. You might also consider adjusting ZVOL_TASKQ_OFFSET_SHIFT to a lower offset value, which could involve more threads. I intentionally kept it at 512M to take advantage of the speculative prefetcher, and you may need to build afterwards.

/*
 * Switch taskq at multiple of 512 MB offset. This can be set to a lower value
 * to utilize more threads for small files but may affect prefetch hits.
 */
#define	ZVOL_TASKQ_OFFSET_SHIFT 29

Setting zvol_num_taskqs=1 would essentially disable this support, i.e., all the threads would be under the same taskq.

arturpzol · 2024-08-20T07:40:57Z

@ixhamza setting zvol_num_taskqs=1 seems to restore the previous behavior of using a single taskq, and in most cases, I experience better performance with this setting. I wonder if zvol_num_taskqs should be set to 1 by default, as in most environments where zvols are used, there is likely to be no lock contention, and the benefits of using multiple taskqs may not be noticeable and potentially leading to worse performance.

Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#15992

amotin reviewed Mar 13, 2024

View reviewed changes

amotin approved these changes Mar 13, 2024

View reviewed changes

amotin reviewed Mar 13, 2024

View reviewed changes

ixhamza force-pushed the NAS-127822 branch from 6f463f7 to 0a9e372 Compare March 13, 2024 16:44

tonyhutter approved these changes Mar 13, 2024

View reviewed changes

ixhamza force-pushed the NAS-127822 branch from 0a9e372 to 3e5bf3a Compare March 14, 2024 20:54

tonynguien reviewed Mar 15, 2024

View reviewed changes

behlendorf approved these changes Apr 4, 2024

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Apr 4, 2024

behlendorf merged commit 99741bd into openzfs:master Apr 4, 2024
17 of 25 checks passed

robn mentioned this pull request Apr 8, 2024

zvol_os: fix compile with blk-mq on Linux 4.x #16069

Merged

13 tasks

ixhamza mentioned this pull request Apr 16, 2024

[2.2] zvol: use multiple taskqs #16098

Merged

13 tasks

amotin mentioned this pull request Sep 10, 2024

The HDD volumes can significantly affect the write performance of SSD volumes on the same server node #16518

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zvol: use multiple taskqs #15992

zvol: use multiple taskqs #15992

ixhamza commented Mar 13, 2024

amotin Mar 13, 2024

amotin left a comment •

edited

Loading

tonyhutter commented Mar 13, 2024

ixhamza commented Mar 13, 2024

tonyhutter commented Mar 13, 2024

ixhamza commented Mar 14, 2024

tonynguien left a comment

tonynguien Mar 15, 2024

ixhamza Mar 15, 2024

tonynguien commented Mar 15, 2024

ixhamza commented Mar 15, 2024

stuartthebruce commented Apr 4, 2024

behlendorf commented Apr 8, 2024

stuartthebruce commented Apr 8, 2024

mmatuska commented Apr 13, 2024

behlendorf commented Apr 15, 2024

arturpzol commented Aug 9, 2024 •

edited

Loading

ixhamza commented Aug 9, 2024 •

edited

Loading

arturpzol commented Aug 20, 2024 •

edited

Loading

zvol: use multiple taskqs #15992

zvol: use multiple taskqs #15992

Conversation

ixhamza commented Mar 13, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin Mar 13, 2024

Choose a reason for hiding this comment

amotin left a comment • edited Loading

Choose a reason for hiding this comment

tonyhutter commented Mar 13, 2024

ixhamza commented Mar 13, 2024

tonyhutter commented Mar 13, 2024

ixhamza commented Mar 14, 2024

tonynguien left a comment

Choose a reason for hiding this comment

tonynguien Mar 15, 2024

Choose a reason for hiding this comment

ixhamza Mar 15, 2024

Choose a reason for hiding this comment

tonynguien commented Mar 15, 2024

ixhamza commented Mar 15, 2024

stuartthebruce commented Apr 4, 2024

behlendorf commented Apr 8, 2024

stuartthebruce commented Apr 8, 2024

mmatuska commented Apr 13, 2024

behlendorf commented Apr 15, 2024

arturpzol commented Aug 9, 2024 • edited Loading

ixhamza commented Aug 9, 2024 • edited Loading

arturpzol commented Aug 20, 2024 • edited Loading

amotin left a comment •

edited

Loading

arturpzol commented Aug 9, 2024 •

edited

Loading

ixhamza commented Aug 9, 2024 •

edited

Loading

arturpzol commented Aug 20, 2024 •

edited

Loading