Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve zfs receive performance by batching writes #10099

Merged
merged 1 commit into from
Mar 16, 2020

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Mar 4, 2020

Motivation and Context

For each WRITE record in the stream, zfs receive creates a DMU
transaction (dmu_tx_create()) and writes this block's data into the
object. If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant. For example, in some
workloads the receieve_writer thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

Description

To improve performance of zfs receive, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all. By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

How Has This Been Tested?

This commit improves the performance of dd if=stream | zfs recv
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@ahrens ahrens added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing Component: Send/Recv "zfs send/recv" feature labels Mar 4, 2020
@gmelikov
Copy link
Member

gmelikov commented Mar 5, 2020

This commit improves the performance of dd if=stream | zfs recv
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Could you please clarify average recordsize of stream?

@ahrens
Copy link
Member Author

ahrens commented Mar 5, 2020

@gmelikov Around 2.5KB (recordsize is 8K, it compresses well, and using zfs send -c).

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved
module/zfs/dmu_recv.c Outdated Show resolved Hide resolved
Copy link
Contributor

@pcd1193182 pcd1193182 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible/desirable in the future to extend this to work better on incremental streams, possibly by allowing small gaps between writes that we would rewrite?

module/zfs/dmu_recv.c Show resolved Hide resolved
man/man5/zfs-module-parameters.5 Show resolved Hide resolved
@ahrens
Copy link
Member Author

ahrens commented Mar 6, 2020

@pcd1193182

Would it be possible/desirable in the future to extend this to work better on incremental streams, possibly by allowing small gaps between writes that we would rewrite?

Gaps between writes are currently allowed. Any writes within zfs_recv_write_batch_size bytes of file offset will be batched together.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 9, 2020
@behlendorf
Copy link
Contributor

It appears this change introduces a leak in the arc_buf_hdr_t_full cache. Which explains the large number of TEST failures. It's only caught when unloading the module and tearing down the caches.

http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20Coverage%20%28TEST%29/builds/1106/steps/shell_9/logs/console

[27190.583749] kmemleak: 31 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
[27194.011461] =============================================================================
[27194.017365] BUG arc_buf_hdr_t_full (Tainted: P           OE  ): Objects remaining in arc_buf_hdr_t_full on __kmem_cache_shutdown()
[27194.025360] -----------------------------------------------------------------------------

[27194.031991] INFO: Slab 0xffffea000ba96200 objects=35 used=2 fp=0xffff8802ea58ac88 flags=0x2fffe000008100
[27194.038269] CPU: 1 PID: 8033 Comm: rmmod Tainted: P    B      OE   4.13.0-coverage1 #2
[27194.038270] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[27194.038271] Call Trace:
[27194.038278]  dump_stack+0xb0/0xe6
[27194.038281]  slab_err+0xb6/0xd0
[27194.038283]  ? __kmalloc+0x2fe/0x3c0
[27194.038285]  ? kzalloc+0x17/0x30
[27194.038287]  __kmem_cache_shutdown+0x1f1/0x500
[27194.038290]  shutdown_cache+0x1c/0x180
[27194.038291]  kmem_cache_destroy+0x234/0x2a0
[27194.038312]  spl_kmem_cache_destroy+0x116/0x4b0 [spl]
[27194.038315]  ? vfree+0x5a/0xf0
[27194.038325]  ? spl_kmem_free_impl+0x44/0x50 [spl]
[27194.038526]  buf_fini+0xa5/0xf0 [zfs]
[27194.038640]  arc_fini+0x213/0x340 [zfs]
[27194.038757]  dmu_fini+0x16/0xb0 [zfs]
[27194.038888]  spa_fini+0x71/0x210 [zfs]
[27194.039023]  zfs_kmod_fini+0xbb/0x150 [zfs]
[27194.039160]  _fini+0x1c/0xc3c [zfs]
[27194.039164]  SyS_delete_module+0x336/0x3e0

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Accepted Ready to integrate (reviewed, tested) labels Mar 10, 2020
@ahrens ahrens force-pushed the recv_batch branch 2 times, most recently from 1072663 to f5ccd53 Compare March 12, 2020 23:36
@codecov-io
Copy link

codecov-io commented Mar 13, 2020

Codecov Report

Merging #10099 into master will decrease coverage by 0.11%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10099      +/-   ##
==========================================
- Coverage   79.44%   79.33%   -0.12%     
==========================================
  Files         385      385              
  Lines      122385   122435      +50     
==========================================
- Hits        97224    97129      -95     
- Misses      25161    25306     +145     
Flag Coverage Δ
#kernel 79.51% <88.31%> (-0.06%) ⬇️
#user 66.77% <0.00%> (-0.20%) ⬇️
Impacted Files Coverage Δ
module/zfs/dmu_recv.c 76.76% <80.00%> (+0.68%) ⬆️
lib/libzutil/zutil_pool.c 38.63% <0.00%> (-54.55%) ⬇️
module/zfs/vdev_indirect.c 75.33% <0.00%> (-10.67%) ⬇️
module/os/linux/spl/spl-kmem-cache.c 75.22% <0.00%> (-9.23%) ⬇️
module/zfs/space_map.c 93.81% <0.00%> (-4.71%) ⬇️
module/zfs/spa_checkpoint.c 93.78% <0.00%> (-4.35%) ⬇️
module/os/linux/zfs/zfs_dir.c 81.79% <0.00%> (-1.39%) ⬇️
module/icp/api/kcf_mac.c 38.85% <0.00%> (-1.15%) ⬇️
module/zfs/dsl_userhold.c 89.17% <0.00%> (-1.12%) ⬇️
module/os/linux/zfs/vdev_disk.c 84.00% <0.00%> (-1.10%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3fe62c...71e0db9. Read the comment docs.

@ahrens
Copy link
Member Author

ahrens commented Mar 13, 2020

I've fixed the leak, thanks for pointing that out @behlendorf. I don't think the other failures are caused by my changes, but could you also check?

@ahrens ahrens added Status: Code Review Needed Ready for review and testing and removed Status: Revision Needed Changes are required for the PR to be accepted labels Mar 13, 2020
@behlendorf
Copy link
Contributor

Look good, and the other failures look unrelated to me. Would you mind rebasing this on master one last time. I'd like to see if we can get a full run on the coverage builder with the kmemleak checker to verify the updated PR.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 13, 2020
@ahrens ahrens force-pushed the recv_batch branch 2 times, most recently from e1e5128 to 444e212 Compare March 13, 2020 17:55
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object.  If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant.  For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all.  By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@behlendorf
Copy link
Contributor

Everything looks good. Thanks for sorting out that leak.

@behlendorf behlendorf merged commit 7261fc2 into openzfs:master Mar 16, 2020
@matveevandrey matveevandrey mentioned this pull request Apr 16, 2020
12 tasks
ahrens added a commit to ahrens/zfs that referenced this pull request Apr 29, 2020
…rformance by batching writes

For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object.  If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant.  For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all.  By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10099
ahrens added a commit to ahrens/zfs that referenced this pull request May 12, 2020
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by openzfs#10099 "Improve zfs receive performance by
batching writes".

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
ahrens added a commit to ahrens/zfs that referenced this pull request May 14, 2020
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by openzfs#10099 "Improve zfs receive performance by
batching writes".

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
behlendorf pushed a commit that referenced this pull request May 15, 2020
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by #10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10320
ahrens added a commit to ahrens/zfs that referenced this pull request May 15, 2020
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by openzfs#10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10320
as-com pushed a commit to as-com/zfs that referenced this pull request Jun 20, 2020
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by openzfs#10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10320 
(cherry picked from commit 1b9cd1a)
as-com pushed a commit to as-com/zfs that referenced this pull request Jun 20, 2020
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by openzfs#10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10320 
(cherry picked from commit 1b9cd1a)
@georgeyil georgeyil mentioned this pull request Nov 11, 2020
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object.  If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant.  For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all.  By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10099
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by openzfs#10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10320
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants