-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol: use multiple taskqs #15992
zvol: use multiple taskqs #15992
Conversation
module/os/linux/zfs/zvol_os.c
Outdated
@@ -532,6 +543,17 @@ zvol_request_impl(zvol_state_t *zv, struct bio *bio, struct request *rq, | |||
} | |||
|
|||
zv_request_task_t *task; | |||
zv_taskq_t *ztqs = &zvol_taskqs; | |||
int blk_mq_hw_queue = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a cosmetics, but as I see, rq->mq_hctx->queue_num is unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems zvol_blk_mq_actual_threads is used only to set zso->tag_set.nr_hw_queues. Would it be better to report there ztqs->tqs_cnt instead and just remove zvol_blk_mq_threads and zvol_blk_mq_actual_threads? Now the value looks pretty arbitrary.
@ixhamza this is a great optimization! Did you happen to try one taskq per CPU (and making the taskq a per-cpu variable)? If so, did you try hashing the taskq to the same CPU as the bio/request? I'm wondering if that's going to be better for the blk_mq path, but possibly worse for the non-multithreaded bio path.
That would give you a 1:1 ratio of blk_mq workqueues to task queues. I feel like that would be optimal, but you'd have to benchmark it. You may also want to make ztqs->tqs_cnt it's own separate module param, with it defaulting to the 1:6 taskq to CPU ratio you found though experimentation. |
@tonyhutter - thank you for your feedback.
I haven't tried it myself yet. However, based on my understanding, the current taskq design doesn't seem to allow us to control thread-to-CPU binding. There is a
I am benchmarking and would share the results regarding 1:1 ratio of blk_mq workqueues to task queues. |
Ah you're right - taskq has to be dynamically allocated via |
Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
According to my blk-mq benchmarks, 1:1 ratio of blk_mq workqueues to task queues causing bottleneck in sequential write performance, although reads are not affected. Therefore, I am keeping the
Added this in the latest commit, thanks for the suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this performance improvement. It looks great. I have just one comment.
@@ -1570,8 +1593,40 @@ zvol_init(void) | |||
zvol_actual_threads = MIN(MAX(zvol_threads, 1), 1024); | |||
} | |||
|
|||
/* | |||
* Use atleast 32 zvol_threads but for many core system, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify to make it easer to reason about. One idea is:
- Systems with >32 CPUs retain previous behavior of 1 taskq and 32 threads or zvol_threads value
- Systems with 32 or greater will scale the number of taskq with CPU count / [6|8] (or some other #define value). And per taskq thread count can be 8 (perhaps making this a tunable)
It would be close to the values and behavior you have and the code would be a little simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if I misunderstood your reasoning, but it's worth noting that single taskq lock contention can be observed even on systems with fewer than 32 CPUs.
The flame graph with the code change shows 7 taskqs. The code states there should be 5 taskqs for 32 CPU system. Am I missing something? |
@tonynguien your observation is correct. |
@behlendorf is this a candidate for 2.2.4? |
@stuartthebruce potentially. We did have a few follow fixes which were needed for this change so I'd like to let it soak for a while. |
Sounds good. FYI, the motivation for me is to be able to replace the following LVM raid5 with ZFS., [root@vsmarchive ~]# cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 nvme9n1[8] nvme8n1[7] nvme7n1[6] nvme6n1[5] nvme5n1[4] nvme4n1[3] nvme3n1[2] nvme2n1[1] nvme1n1[0] 100018432512 blocks super 1.2 level 5, 32k chunk, algorithm 2 [9/9] [UUUUUUUUU] bitmap: 1/94 pages [4KB], 65536KB chunk However, the current zvol performance isn't quite good enough. It can handle the following average load, [root@vsmarchive ~]# iostat -xzm Linux 3.10.0-1160.108.1.el7.x86_64 (vsmarchive) 04/08/2024 _x86_64_ (64 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.94 0.00 2.19 0.62 0.00 96.25 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme10n1 0.00 0.00 25.63 43.11 0.04 0.55 17.74 0.00 0.06 0.01 0.09 0.11 0.79 nvme2n1 4059.42 4997.77 1604.45 132.61 29.58 20.04 58.51 0.57 0.33 0.23 1.54 0.11 18.32 nvme1n1 4059.42 4997.77 1604.45 132.61 29.58 20.04 58.51 0.57 0.33 0.23 1.55 0.11 18.33 nvme7n1 4059.42 4997.77 1604.45 132.60 29.58 20.04 58.51 0.56 0.32 0.23 1.52 0.11 18.33 nvme5n1 4059.42 4997.77 1604.44 132.61 29.58 20.04 58.51 0.57 0.33 0.23 1.53 0.11 18.34 nvme4n1 4059.42 4997.77 1604.44 132.61 29.58 20.04 58.51 0.57 0.33 0.23 1.54 0.11 18.35 nvme6n1 4059.43 4997.77 1604.45 132.61 29.58 20.04 58.51 0.57 0.33 0.23 1.53 0.11 18.35 nvme3n1 4059.43 4997.77 1604.45 132.61 29.58 20.04 58.51 0.57 0.33 0.23 1.55 0.11 18.35 nvme0n1 0.00 0.00 25.49 43.11 0.04 0.55 17.77 0.00 0.06 0.01 0.09 0.11 0.78 nvme9n1 4059.43 4997.77 1604.46 132.61 29.58 20.04 58.51 0.48 0.27 0.23 0.85 0.11 18.31 nvme8n1 4059.42 4997.77 1604.45 132.60 29.58 20.04 58.51 0.48 0.27 0.23 0.85 0.11 18.31 sda 6.30 0.07 9.43 3.63 0.91 0.12 161.00 0.03 2.37 0.41 7.44 0.38 0.50 sdb 6.30 0.07 6.71 3.63 0.81 0.12 184.66 0.04 3.58 0.46 9.36 0.44 0.45 md127 0.00 0.00 3.52 2.27 0.14 0.12 92.69 0.00 0.00 0.00 0.00 0.00 0.00 md126 0.00 0.00 0.01 0.00 0.00 0.00 61.53 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 2.67 1.91 0.04 0.03 30.92 0.04 8.45 0.22 19.95 2.82 1.29 md0 0.00 0.00 2547.63 41027.40 77.07 160.26 11.15 0.00 0.00 0.00 0.00 0.00 0.00 zd0 0.00 0.00 1675.25 174.80 31.92 0.68 36.09 0.05 0.03 0.03 0.01 0.02 2.94 dm-1 0.00 0.00 0.85 0.26 0.10 0.09 358.75 0.25 224.87 0.80 957.64 3.03 0.33 but not the order of magnitude larger peak loads. |
Possible candidate for zfs-2.2.4-staging? |
@mmatuska yes, I haven't seen any reports of this causing problems. |
Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#15992
Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15992
After upgrading to OpenZFS version 2.2.4 (kernel 5.15), I have observed worse sequential write performance on zvols under various workloads, as shown in the chart below:: The introduction of multiple taskqs appears to negatively impact performance, especially when using a small number of threads. Even with more threads, the performance improvement is not as significant as what is shown in the graph from ixhamza. I'm confident that the commit introducing multiple taskqs is responsible for the regression because I reverted this commit in OpenZFS 2.2.4, and the performance reverted to what I experienced with version 2.2.3. Experimentally, I found that setting It is possible that this issue is hardware-dependent. I performed the tests on the following hardware:
I noticed that the author of the commit performed their tests on an AMD EPYC 7543, so it seems the test results might be hardware-dependent. Additionally, I ran tests on different hardware and observed similar behavior:
What I find most interesting is the lack of growth over ZFS 2.2.3 for more threads. Unless I'm misunderstanding something... FIO configuration:
|
@arturpzol - This change helps avoid single taskq lock contention, which can become a bottleneck on fast storage under a heavy workload. If you are not experiencing lock contention, this change may not be beneficial. You might also consider adjusting
Setting |
@ixhamza setting |
Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#15992
Motivation and Context
The current implementation of zvol uses a single taskq, leading to lock contention under heavy load and consequently decreased throughput. Introducing multiple taskqs and implementing a switch based on IO offset can alleviate this lock contention, thus improving overall throughput.
Description
Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scales dynamically based on the available CPUs in the system, as illustrated below:
How Has This Been Tested?
FIO tests are conducted using the above configuration, manually configuring all jobs with different offsets and sizes to ensure they do not overlap. The benchmarks for sequential reads are provided below for reference. In the graph, the blue line represents the performance of the current single taskq implementation, while the other lines depict the performance after switching to multiple taskqs with different offsets. As the FIO jobs increase, i.e., 32, the performance with multiple taskqs is nearly double that of the current single taskq implementation.
Falme Graph with 32 fio jobs prior to changes:
Falme Graph with 32 fio jobs with changes applied:
Types of changes
Checklist:
Signed-off-by
.