Improve zfs receive performance by batching writes #10099

ahrens · 2020-03-04T17:15:55Z

Motivation and Context

For each WRITE record in the stream, zfs receive creates a DMU
transaction (dmu_tx_create()) and writes this block's data into the
object. If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant. For example, in some
workloads the receieve_writer thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

Description

To improve performance of zfs receive, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all. By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

How Has This Been Tested?

This commit improves the performance of dd if=stream | zfs recv
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

gmelikov · 2020-03-05T06:43:03Z

This commit improves the performance of dd if=stream | zfs recv
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Could you please clarify average recordsize of stream?

ahrens · 2020-03-05T06:46:19Z

@gmelikov Around 2.5KB (recordsize is 8K, it compresses well, and using zfs send -c).

man/man5/zfs-module-parameters.5

module/zfs/dmu_recv.c

pcd1193182

Would it be possible/desirable in the future to extend this to work better on incremental streams, possibly by allowing small gaps between writes that we would rewrite?

module/zfs/dmu_recv.c

man/man5/zfs-module-parameters.5

ahrens · 2020-03-06T20:34:45Z

@pcd1193182

Would it be possible/desirable in the future to extend this to work better on incremental streams, possibly by allowing small gaps between writes that we would rewrite?

Gaps between writes are currently allowed. Any writes within zfs_recv_write_batch_size bytes of file offset will be batched together.

behlendorf · 2020-03-10T16:33:47Z

It appears this change introduces a leak in the arc_buf_hdr_t_full cache. Which explains the large number of TEST failures. It's only caught when unloading the module and tearing down the caches.

http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20Coverage%20%28TEST%29/builds/1106/steps/shell_9/logs/console

[27190.583749] kmemleak: 31 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
[27194.011461] =============================================================================
[27194.017365] BUG arc_buf_hdr_t_full (Tainted: P           OE  ): Objects remaining in arc_buf_hdr_t_full on __kmem_cache_shutdown()
[27194.025360] -----------------------------------------------------------------------------

[27194.031991] INFO: Slab 0xffffea000ba96200 objects=35 used=2 fp=0xffff8802ea58ac88 flags=0x2fffe000008100
[27194.038269] CPU: 1 PID: 8033 Comm: rmmod Tainted: P    B      OE   4.13.0-coverage1 #2
[27194.038270] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[27194.038271] Call Trace:
[27194.038278]  dump_stack+0xb0/0xe6
[27194.038281]  slab_err+0xb6/0xd0
[27194.038283]  ? __kmalloc+0x2fe/0x3c0
[27194.038285]  ? kzalloc+0x17/0x30
[27194.038287]  __kmem_cache_shutdown+0x1f1/0x500
[27194.038290]  shutdown_cache+0x1c/0x180
[27194.038291]  kmem_cache_destroy+0x234/0x2a0
[27194.038312]  spl_kmem_cache_destroy+0x116/0x4b0 [spl]
[27194.038315]  ? vfree+0x5a/0xf0
[27194.038325]  ? spl_kmem_free_impl+0x44/0x50 [spl]
[27194.038526]  buf_fini+0xa5/0xf0 [zfs]
[27194.038640]  arc_fini+0x213/0x340 [zfs]
[27194.038757]  dmu_fini+0x16/0xb0 [zfs]
[27194.038888]  spa_fini+0x71/0x210 [zfs]
[27194.039023]  zfs_kmod_fini+0xbb/0x150 [zfs]
[27194.039160]  _fini+0x1c/0xc3c [zfs]
[27194.039164]  SyS_delete_module+0x336/0x3e0

codecov-io · 2020-03-13T03:22:14Z

Codecov Report

Merging #10099 into master will decrease coverage by 0.11%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master   #10099      +/-   ##
==========================================
- Coverage   79.44%   79.33%   -0.12%     
==========================================
  Files         385      385              
  Lines      122385   122435      +50     
==========================================
- Hits        97224    97129      -95     
- Misses      25161    25306     +145

Flag	Coverage Δ
#kernel	`79.51% <88.31%> (-0.06%)`	⬇️
#user	`66.77% <0.00%> (-0.20%)`	⬇️

Impacted Files	Coverage Δ
module/zfs/dmu_recv.c	`76.76% <80.00%> (+0.68%)`	⬆️
lib/libzutil/zutil_pool.c	`38.63% <0.00%> (-54.55%)`	⬇️
module/zfs/vdev_indirect.c	`75.33% <0.00%> (-10.67%)`	⬇️
module/os/linux/spl/spl-kmem-cache.c	`75.22% <0.00%> (-9.23%)`	⬇️
module/zfs/space_map.c	`93.81% <0.00%> (-4.71%)`	⬇️
module/zfs/spa_checkpoint.c	`93.78% <0.00%> (-4.35%)`	⬇️
module/os/linux/zfs/zfs_dir.c	`81.79% <0.00%> (-1.39%)`	⬇️
module/icp/api/kcf_mac.c	`38.85% <0.00%> (-1.15%)`	⬇️
module/zfs/dsl_userhold.c	`89.17% <0.00%> (-1.12%)`	⬇️
module/os/linux/zfs/vdev_disk.c	`84.00% <0.00%> (-1.10%)`	⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3fe62c...71e0db9. Read the comment docs.

ahrens · 2020-03-13T15:52:55Z

I've fixed the leak, thanks for pointing that out @behlendorf. I don't think the other failures are caused by my changes, but could you also check?

behlendorf · 2020-03-13T16:41:10Z

Look good, and the other failures look unrelated to me. Would you mind rebasing this on master one last time. I'd like to see if we can get a full run on the coverage builder with the kmemleak checker to verify the updated PR.

For each WRITE record in the stream, `zfs receive` creates a DMU transaction (`dmu_tx_create()`) and writes this block's data into the object. If per-block overheads (as opposed to per-byte overheads) dominate performance (as is often the case with small recordsize), the per-dmu-transaction overheads can be significant. For example, in some workloads the `receieve_writer` thread is 100% on CPU, and more than half of its CPU time is in these per-tx routines (e.g. dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit). To improve performance of `zfs receive`, this commit batches WRITE records which are to nearby offsets of the same object, and uses one DMU transaction to write them all. By default the batch size is 1MB, which for recordsize=8K reduces the number of DMU transactions by 128x for full send streams (incrementals will depend on how "clumpy" the changed blocks are). This commit improves the performance of `dd if=stream | zfs recv` from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement). Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

behlendorf · 2020-03-16T18:49:54Z

Everything looks good. Thanks for sorting out that leak.

…rformance by batching writes For each WRITE record in the stream, `zfs receive` creates a DMU transaction (`dmu_tx_create()`) and writes this block's data into the object. If per-block overheads (as opposed to per-byte overheads) dominate performance (as is often the case with small recordsize), the per-dmu-transaction overheads can be significant. For example, in some workloads the `receieve_writer` thread is 100% on CPU, and more than half of its CPU time is in these per-tx routines (e.g. dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit). To improve performance of `zfs receive`, this commit batches WRITE records which are to nearby offsets of the same object, and uses one DMU transaction to write them all. By default the batch size is 1MB, which for recordsize=8K reduces the number of DMU transactions by 128x for full send streams (incrementals will depend on how "clumpy" the changed blocks are). This commit improves the performance of `dd if=stream | zfs recv` from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10099

If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a *non-zero* error. This bug was introduced by openzfs#10099 "Improve zfs receive performance by batching writes". Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a *non-zero* error. This bug was introduced by #10099 "Improve zfs receive performance by batching writes". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10320

If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a *non-zero* error. This bug was introduced by openzfs#10099 "Improve zfs receive performance by batching writes". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10320

If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a *non-zero* error. This bug was introduced by openzfs#10099 "Improve zfs receive performance by batching writes". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10320 (cherry picked from commit 1b9cd1a)

For each WRITE record in the stream, `zfs receive` creates a DMU transaction (`dmu_tx_create()`) and writes this block's data into the object. If per-block overheads (as opposed to per-byte overheads) dominate performance (as is often the case with small recordsize), the per-dmu-transaction overheads can be significant. For example, in some workloads the `receieve_writer` thread is 100% on CPU, and more than half of its CPU time is in these per-tx routines (e.g. dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit). To improve performance of `zfs receive`, this commit batches WRITE records which are to nearby offsets of the same object, and uses one DMU transaction to write them all. By default the batch size is 1MB, which for recordsize=8K reduces the number of DMU transactions by 128x for full send streams (incrementals will depend on how "clumpy" the changed blocks are). This commit improves the performance of `dd if=stream | zfs recv` from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10099

If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a *non-zero* error. This bug was introduced by openzfs#10099 "Improve zfs receive performance by batching writes". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10320

ahrens added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing Component: Send/Recv "zfs send/recv" feature labels Mar 4, 2020

ahrens requested review from pcd1193182 and behlendorf March 4, 2020 17:15

behlendorf reviewed Mar 5, 2020

View reviewed changes

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved

module/zfs/dmu_recv.c Outdated Show resolved Hide resolved

ahrens force-pushed the recv_batch branch from 66588db to 259e5ce Compare March 6, 2020 16:20

behlendorf approved these changes Mar 6, 2020

View reviewed changes

pcd1193182 approved these changes Mar 6, 2020

View reviewed changes

module/zfs/dmu_recv.c Show resolved Hide resolved

man/man5/zfs-module-parameters.5 Show resolved Hide resolved

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 9, 2020

ahrens force-pushed the recv_batch branch from 259e5ce to 41e1599 Compare March 9, 2020 22:08

behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Accepted Ready to integrate (reviewed, tested) labels Mar 10, 2020

ahrens force-pushed the recv_batch branch 2 times, most recently from 1072663 to f5ccd53 Compare March 12, 2020 23:36

ahrens added Status: Code Review Needed Ready for review and testing and removed Status: Revision Needed Changes are required for the PR to be accepted labels Mar 13, 2020

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 13, 2020

ahrens force-pushed the recv_batch branch 2 times, most recently from e1e5128 to 444e212 Compare March 13, 2020 17:55

ahrens force-pushed the recv_batch branch from 444e212 to 71e0db9 Compare March 13, 2020 17:56

behlendorf merged commit 7261fc2 into openzfs:master Mar 16, 2020

matveevandrey mentioned this pull request Apr 16, 2020

Proposed zfs-0.8.4 patchset #10209

Merged

12 tasks

ahrens mentioned this pull request May 12, 2020

fix error handling in receive_writer_thread() #10320

Merged

12 tasks

georgeyil mentioned this pull request Nov 11, 2020

PR involvement #11192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve zfs receive performance by batching writes #10099

Improve zfs receive performance by batching writes #10099

ahrens commented Mar 4, 2020

gmelikov commented Mar 5, 2020

ahrens commented Mar 5, 2020

pcd1193182 left a comment

ahrens commented Mar 6, 2020

behlendorf commented Mar 10, 2020

codecov-io commented Mar 13, 2020 •

edited

Loading

ahrens commented Mar 13, 2020

behlendorf commented Mar 13, 2020

behlendorf commented Mar 16, 2020

Improve zfs receive performance by batching writes #10099

Improve zfs receive performance by batching writes #10099

Conversation

ahrens commented Mar 4, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

gmelikov commented Mar 5, 2020

ahrens commented Mar 5, 2020

pcd1193182 left a comment

Choose a reason for hiding this comment

ahrens commented Mar 6, 2020

behlendorf commented Mar 10, 2020

codecov-io commented Mar 13, 2020 • edited Loading

Codecov Report

ahrens commented Mar 13, 2020

behlendorf commented Mar 13, 2020

behlendorf commented Mar 16, 2020

codecov-io commented Mar 13, 2020 •

edited

Loading