Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvol: use multiple taskqs #15992

Merged
merged 1 commit into from
Apr 4, 2024
Merged

zvol: use multiple taskqs #15992

merged 1 commit into from
Apr 4, 2024

Conversation

ixhamza
Copy link
Contributor

@ixhamza ixhamza commented Mar 13, 2024

Motivation and Context

The current implementation of zvol uses a single taskq, leading to lock contention under heavy load and consequently decreased throughput. Introducing multiple taskqs and implementing a switch based on IO offset can alleviate this lock contention, thus improving overall throughput.

Description

Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scales dynamically based on the available CPUs in the system, as illustrated below:

                taskq   total
cpus    taskqs  threads threads
------- ------- ------- -------
1       1       32       32
2       1       32       32
4       1       32       32
8       2       16       32
16      3       11       33
32      5       7        35
64      8       8        64
128     11      12       132
256     16      16       256

How Has This Been Tested?

  • Local zfs-tests and buildbot tests.
  • Fio performance testing with increasing fio threads.
Testing Environment:
- System: AMD EPYC 7543 32-Core Processor, 128 GB DRAM
- Drives/Pool Configuration: 6x Kioxia NVMe SSD KCM6DVUL1T60 (4 Stripe Data, 1 SLOG, and 1 L2CACHE)
- OS: Debian GNU/Linux 12 (bookworm), 6.6.16
- ARC Size: 128G
- ADS: 128G
- IO Size / Volblock Size: 128K
- Direct IO: 1

FIO tests are conducted using the above configuration, manually configuring all jobs with different offsets and sizes to ensure they do not overlap. The benchmarks for sequential reads are provided below for reference. In the graph, the blue line represents the performance of the current single taskq implementation, while the other lines depict the performance after switching to multiple taskqs with different offsets. As the FIO jobs increase, i.e., 32, the performance with multiple taskqs is nearly double that of the current single taskq implementation.

seq_read_taskq_benchmark

  • Falme Graph with 32 fio jobs prior to changes:
    sread-fg-multi-tq

  • Falme Graph with 32 fio jobs with changes applied:
    sread-fg-single-tq

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@@ -532,6 +543,17 @@ zvol_request_impl(zvol_state_t *zv, struct bio *bio, struct request *rq,
}

zv_request_task_t *task;
zv_taskq_t *ztqs = &zvol_taskqs;
int blk_mq_hw_queue = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a cosmetics, but as I see, rq->mq_hctx->queue_num is unsigned.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems zvol_blk_mq_actual_threads is used only to set zso->tag_set.nr_hw_queues. Would it be better to report there ztqs->tqs_cnt instead and just remove zvol_blk_mq_threads and zvol_blk_mq_actual_threads? Now the value looks pretty arbitrary.

@tonyhutter
Copy link
Contributor

@ixhamza this is a great optimization! Did you happen to try one taskq per CPU (and making the taskq a per-cpu variable)? If so, did you try hashing the taskq to the same CPU as the bio/request? I'm wondering if that's going to be better for the blk_mq path, but possibly worse for the non-multithreaded bio path.

It seems zvol_blk_mq_actual_threads is used only to set zso->tag_set.nr_hw_queues. Would it be better to report there ztqs->tqs_cnt instead and just remove zvol_blk_mq_threads and zvol_blk_mq_actual_threads? Now the value looks pretty arbitrary.

That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it. You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation.

@ixhamza
Copy link
Contributor Author

ixhamza commented Mar 13, 2024

@tonyhutter - thank you for your feedback.

Did you happen to try one taskq per CPU (and making the taskq a per-cpu variable)? If so, did you try hashing the taskq to the same CPU as the bio/request? I'm wondering if that's going to be better for the blk_mq path, but possibly worse for the non-multithreaded bio path.

I haven't tried it myself yet. However, based on my understanding, the current taskq design doesn't seem to allow us to control thread-to-CPU binding. There is a spl_taskq_thread_bind variable but that seems to work globally. Even if it were possible, I believe it could potentially worsen prefetch hits due to request reordering. Let me know if I've missed something or you want me to test some other scenario.

That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it. You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation.

I am benchmarking and would share the results regarding 1:1 ratio of blk_mq workqueues to task queues.

@tonyhutter
Copy link
Contributor

However, based on my understanding, the current taskq design doesn't seem to allow us to control thread-to-CPU binding.

Ah you're right - taskq has to be dynamically allocated via taskq_create(), and we don't really have control over what CPU it's pinned to other than spl_taskq_thread_bind, which doesn't give us what we want.

Currently, zvol uses a single taskq, resulting in throughput bottleneck
under heavy load due to lock contention on the single taskq. This patch
addresses the performance bottleneck under heavy load conditions by
utilizing multiple taskqs, thus mitigating lock contention. The number
of taskqs scale dynamically based on the available CPUs in the system,
as illustrated below:

                taskq   total
cpus    taskqs  threads threads
------- ------- ------- -------
1       1       32       32
2       1       32       32
4       1       32       32
8       2       16       32
16      3       11       33
32      5       7        35
64      8       8        64
128     11      12       132
256     16      16       256

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza
Copy link
Contributor Author

ixhamza commented Mar 14, 2024

That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it.

According to my blk-mq benchmarks, 1:1 ratio of blk_mq workqueues to task queues causing bottleneck in sequential write performance, although reads are not affected. Therefore, I am keeping the zvol_blk_mq_threads path unchanged in this PR.

You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation.

Added this in the latest commit, thanks for the suggestion.

Copy link
Contributor

@tonynguien tonynguien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this performance improvement. It looks great. I have just one comment.

@@ -1570,8 +1593,40 @@ zvol_init(void)
zvol_actual_threads = MIN(MAX(zvol_threads, 1), 1024);
}

/*
* Use atleast 32 zvol_threads but for many core system,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify to make it easer to reason about. One idea is:

  • Systems with >32 CPUs retain previous behavior of 1 taskq and 32 threads or zvol_threads value
  • Systems with 32 or greater will scale the number of taskq with CPU count / [6|8] (or some other #define value). And per taskq thread count can be 8 (perhaps making this a tunable)

It would be close to the values and behavior you have and the code would be a little simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I misunderstood your reasoning, but it's worth noting that single taskq lock contention can be observed even on systems with fewer than 32 CPUs.

@tonynguien
Copy link
Contributor

The flame graph with the code change shows 7 taskqs. The code states there should be 5 taskqs for 32 CPU system. Am I missing something?

@ixhamza
Copy link
Contributor Author

ixhamza commented Mar 15, 2024

@tonynguien your observation is correct. AMD EPYC 7543 has 64 threads, so there are 8 taskqs in total: https://www.amd.com/en/products/cpu/amd-epyc-7543.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Apr 4, 2024
@behlendorf behlendorf merged commit 99741bd into openzfs:master Apr 4, 2024
17 of 25 checks passed
@stuartthebruce
Copy link

@behlendorf is this a candidate for 2.2.4?

@behlendorf
Copy link
Contributor

@stuartthebruce potentially. We did have a few follow fixes which were needed for this change so I'd like to let it soak for a while.

@stuartthebruce
Copy link

@stuartthebruce potentially. We did have a few follow fixes which were needed for this change so I'd like to let it soak for a while.

Sounds good. FYI, the motivation for me is to be able to replace the following LVM raid5 with ZFS.,

[root@vsmarchive ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] 
md0 : active raid5 nvme9n1[8] nvme8n1[7] nvme7n1[6] nvme6n1[5] nvme5n1[4] nvme4n1[3] nvme3n1[2] nvme2n1[1] nvme1n1[0]
      100018432512 blocks super 1.2 level 5, 32k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      bitmap: 1/94 pages [4KB], 65536KB chunk

However, the current zvol performance isn't quite good enough. It can handle the following average load,

[root@vsmarchive ~]# iostat -xzm
Linux 3.10.0-1160.108.1.el7.x86_64 (vsmarchive) 	04/08/2024 	_x86_64_	(64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.94    0.00    2.19    0.62    0.00   96.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme10n1          0.00     0.00   25.63   43.11     0.04     0.55    17.74     0.00    0.06    0.01    0.09   0.11   0.79
nvme2n1        4059.42  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.54   0.11  18.32
nvme1n1        4059.42  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.55   0.11  18.33
nvme7n1        4059.42  4997.77 1604.45  132.60    29.58    20.04    58.51     0.56    0.32    0.23    1.52   0.11  18.33
nvme5n1        4059.42  4997.77 1604.44  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.53   0.11  18.34
nvme4n1        4059.42  4997.77 1604.44  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.54   0.11  18.35
nvme6n1        4059.43  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.53   0.11  18.35
nvme3n1        4059.43  4997.77 1604.45  132.61    29.58    20.04    58.51     0.57    0.33    0.23    1.55   0.11  18.35
nvme0n1           0.00     0.00   25.49   43.11     0.04     0.55    17.77     0.00    0.06    0.01    0.09   0.11   0.78
nvme9n1        4059.43  4997.77 1604.46  132.61    29.58    20.04    58.51     0.48    0.27    0.23    0.85   0.11  18.31
nvme8n1        4059.42  4997.77 1604.45  132.60    29.58    20.04    58.51     0.48    0.27    0.23    0.85   0.11  18.31
sda               6.30     0.07    9.43    3.63     0.91     0.12   161.00     0.03    2.37    0.41    7.44   0.38   0.50
sdb               6.30     0.07    6.71    3.63     0.81     0.12   184.66     0.04    3.58    0.46    9.36   0.44   0.45
md127             0.00     0.00    3.52    2.27     0.14     0.12    92.69     0.00    0.00    0.00    0.00   0.00   0.00
md126             0.00     0.00    0.01    0.00     0.00     0.00    61.53     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    2.67    1.91     0.04     0.03    30.92     0.04    8.45    0.22   19.95   2.82   1.29
md0               0.00     0.00 2547.63 41027.40    77.07   160.26    11.15     0.00    0.00    0.00    0.00   0.00   0.00
zd0               0.00     0.00 1675.25  174.80    31.92     0.68    36.09     0.05    0.03    0.03    0.01   0.02   2.94
dm-1              0.00     0.00    0.85    0.26     0.10     0.09   358.75     0.25  224.87    0.80  957.64   3.03   0.33

but not the order of magnitude larger peak loads.

@mmatuska
Copy link
Contributor

Possible candidate for zfs-2.2.4-staging?

@behlendorf
Copy link
Contributor

@mmatuska yes, I haven't seen any reports of this causing problems.

ixhamza added a commit to ixhamza/zfs that referenced this pull request Apr 16, 2024
Currently, zvol uses a single taskq, resulting in throughput bottleneck
under heavy load due to lock contention on the single taskq. This patch
addresses the performance bottleneck under heavy load conditions by
utilizing multiple taskqs, thus mitigating lock contention. The number
of taskqs scale dynamically based on the available CPUs in the system,
as illustrated below:

                taskq   total
cpus    taskqs  threads threads
------- ------- ------- -------
1       1       32       32
2       1       32       32
4       1       32       32
8       2       16       32
16      3       11       33
32      5       7        35
64      8       8        64
128     11      12       132
256     16      16       256

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes openzfs#15992
@ixhamza ixhamza mentioned this pull request Apr 16, 2024
13 tasks
tonyhutter pushed a commit that referenced this pull request Apr 17, 2024
Currently, zvol uses a single taskq, resulting in throughput bottleneck
under heavy load due to lock contention on the single taskq. This patch
addresses the performance bottleneck under heavy load conditions by
utilizing multiple taskqs, thus mitigating lock contention. The number
of taskqs scale dynamically based on the available CPUs in the system,
as illustrated below:

                taskq   total
cpus    taskqs  threads threads
------- ------- ------- -------
1       1       32       32
2       1       32       32
4       1       32       32
8       2       16       32
16      3       11       33
32      5       7        35
64      8       8        64
128     11      12       132
256     16      16       256

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #15992
@arturpzol
Copy link

arturpzol commented Aug 9, 2024

After upgrading to OpenZFS version 2.2.4 (kernel 5.15), I have observed worse sequential write performance on zvols under various workloads, as shown in the chart below::

ZFS223vs224

The introduction of multiple taskqs appears to negatively impact performance, especially when using a small number of threads.

Even with more threads, the performance improvement is not as significant as what is shown in the graph from ixhamza.

I'm confident that the commit introducing multiple taskqs is responsible for the regression because I reverted this commit in OpenZFS 2.2.4, and the performance reverted to what I experienced with version 2.2.3.

Experimentally, I found that setting zvol_num_taskqs=2 brought performance closer to the levels seen in 2.2.3. However, this requires manually adjusting the parameter. The default value of 0 (which scales internally to prefer 6 threads per taskq) does not seem to be very optimal for sequential write workloads.

It is possible that this issue is hardware-dependent. I performed the tests on the following hardware:

    System 1: Intel Xeon Silver 4309Y, 128 GB RAM, 22 x HDDs
    FIO iosize 128k, zvol volblocksize 128k
    Pool Structure: All disks are configured as single vdevs 

I noticed that the author of the commit performed their tests on an AMD EPYC 7543, so it seems the test results might be hardware-dependent.

Additionally, I ran tests on different hardware and observed similar behavior:

    System 2: 2x Intel Xeon Gold 5222, 192 GB RAM, 24 x NVMe
    FIO iosize 128k, zvol volblocksize 128k
    Pool Structure: All disks are configured as single vdevs 

ZFS223vs224_nvme

What I find most interesting is the lack of growth over ZFS 2.2.3 for more threads. Unless I'm misunderstanding something...

FIO configuration:


[global]
overwrite=1
end_fsync=1
ss=iops_slope:0.2%
ss_dur=30s
ss_ramp=15s
time_based
runtime=1000s
ioengine=libaio
refill_buffers
invalidate=1
nice=-19
prio=1
fallocate=none
group_reporting
gtod_reduce=0
randrepeat=0
ramp_time=0s
direct=1
log_avg_msec=1000
log_hist_msec=1000
per_job_logs=0
random_generator=lfsr
filename=/dev/Pool-1/zvol128kb
rw=read
percentage_random=0
rwmixwrite=0
iodepth=1
bs=128K
write_iops_log=logs/iops/seq_read_T64Q1
write_lat_log=logs/lat/seq_read_T64Q1
write_hist_log=logs/hist/seq_read_T64Q1
[0]
size=3125M
offset=0M
[1]
size=3125M
offset=3125M
[2]
size=3125M
offset=6250M
[3]
size=3125M
offset=9375M
[4]
size=3125M
offset=12500M
[5]
size=3125M
offset=15625M
[6]
size=3125M
offset=18750M

@ixhamza
Copy link
Contributor Author

ixhamza commented Aug 9, 2024

@arturpzol - This change helps avoid single taskq lock contention, which can become a bottleneck on fast storage under a heavy workload. If you are not experiencing lock contention, this change may not be beneficial. You might also consider adjusting ZVOL_TASKQ_OFFSET_SHIFT to a lower offset value, which could involve more threads. I intentionally kept it at 512M to take advantage of the speculative prefetcher, and you may need to build afterwards.

/*
 * Switch taskq at multiple of 512 MB offset. This can be set to a lower value
 * to utilize more threads for small files but may affect prefetch hits.
 */
#define	ZVOL_TASKQ_OFFSET_SHIFT 29

Setting zvol_num_taskqs=1 would essentially disable this support, i.e., all the threads would be under the same taskq.

@arturpzol
Copy link

arturpzol commented Aug 20, 2024

@ixhamza setting zvol_num_taskqs=1 seems to restore the previous behavior of using a single taskq, and in most cases, I experience better performance with this setting. I wonder if zvol_num_taskqs should be set to 1 by default, as in most environments where zvols are used, there is likely to be no lock contention, and the benefits of using multiple taskqs may not be noticeable and potentially leading to worse performance.

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Currently, zvol uses a single taskq, resulting in throughput bottleneck
under heavy load due to lock contention on the single taskq. This patch
addresses the performance bottleneck under heavy load conditions by
utilizing multiple taskqs, thus mitigating lock contention. The number
of taskqs scale dynamically based on the available CPUs in the system,
as illustrated below:

                taskq   total
cpus    taskqs  threads threads
------- ------- ------- -------
1       1       32       32
2       1       32       32
4       1       32       32
8       2       16       32
16      3       11       33
32      5       7        35
64      8       8        64
128     11      12       132
256     16      16       256

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes openzfs#15992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants