Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zstream redup command to convert deduplicated send streams #10156

Merged
merged 3 commits into from
Apr 10, 2020

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Mar 25, 2020

Motivation and Context

Deduplicated send and receive is deprecated. To ease migration to the
new dedup-send-less world, the commit adds a zstream redup utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

#10124

Description

The new zstream command also replaces the functionality of
zstreamdump, by way of the zstream dump subcommand. The
zstreamdump command is replaced by a shell script which invokes
zstream dump.

The way that zstream redup works under the hood is that as we read the
send stream, we build up a hash table which maps from <GUID, object, offset> -> <file_offset>.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is drr_toguid, drr_object, drr_offset.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records. We find the
referenced WRITE record by looking in the hash table (for the record
with key drr_refguid, drr_refobject, drr_refoffset), and then reading
the record header and payload from the specified offset in the stream
file. This is why the stream can not be a pipe. The found WRITE record
replaces the WRITE_BYREF record, with its drr_toguid, drr_object,
and drr_offset fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as zfs send -D), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for zfs send -D). A 1TB send stream
with 8KB blocks (recordsize=8k) would use around 5GB of RAM to
"redup".

The new manpage is reproduced here:

ZSTREAM(8)                BSD System Manager's Manual               ZSTREAM(8)

NAME
     zstream — manipulate zfs send streams

SYNOPSIS
     zstream dump [-Cvd] [file]
     zstream redup [-v] file

DESCRIPTION
     The zstream utility manipulates zfs send streams, which are the output of
     the zfs send command.

     zstream dump [-Cvd] [file]
       Print information about the specified send stream, including headers
       and record counts.  The send stream may either be in the specified
       file, or provided on standard input.

       -C  Suppress the validation of checksums.

       -v  Verbose.  Print metadata for each record.

       -d  Dump data contained in each record.  Implies verbose.

     zstream redup [-v] file
       Deduplicated send streams can be generated by using the zfs send -D
       command.  The ability to send deduplicated send streams is deprecated.
       In the future, the ability to receive a deduplicated send stream with
       zfs receive will be removed.  However, deduplicated send streams can
       still be received by utilizing zstream redup.

       The zstream redup command is provided a file containing a deduplicated
       send stream, and outputs an equivalent non-deduplicated send stream on
       standard output.  Therefore, a deduplicated send stream can be received
       by running:

       # zstream redup DEDUP_STREAM_FILE | zfs receive ...

       -v  Verbose.  Print summary of converted records.

SEE ALSO
     zfs(8), zfs-send(8), zfs-receive(8)

Linux                           March 25, 2020                           Linux

How Has This Been Tested?

Manual testing. I'd also like to add some tests to the ZTS.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 26, 2020
Copy link
Contributor

@lundman lundman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing juicy, compiles and runs on osx.

} redup_table_t;

static int
high_order_bit(uint64_t n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, should we use highbit() / highbit64()` - since I had to add that to Windows porting layer already, but it is also nice that it's just part of the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately highbit64 is not in libzfs. It's only in the kernel, libzpool, and the zpool command (zpool_util.c). Seems like something that could/should be moved to libzfs_util.c. For now I've at least made this consistent with the naming and definition of highbit64().

rdt.numhashbits = high_order_bit(numbuckets) - 1;

char *buf = safe_calloc(bufsz);
FILE *ofp = fdopen(infd, "r");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error checking is probably ok, since we checked infd above.

/*
* Typically the END record is either the last
* thing in the stream, or it is followed
* by a BEGIN record (which also zero's the cheksum).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksum - maybe even "zeros".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

#include <stddef.h>
#include <stddef.h>
#include <stdio.h>
#include <stdio.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdio.h and stddef.h included twice.

highbit64(uint64_t i)
{
if (i == 0)
return (0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is indented to the wrong level.

if (!ISP2(numbuckets))
numbuckets = 1ULL << highbit64(numbuckets);

rdt.redup_hash_array = calloc(numbuckets, sizeof (redup_entry_t *));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to use safe_calloc here?

@@ -182,6 +182,7 @@ export ZFS_FILES='zdb
dbufstat
zed
zgenhostid
zstream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide some test coverage for the new utility how about extending the existing rsend/send-cD.ksh and cli_root/zfs_receive/zfs_receive_013_pos.ksh to additionally use zstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing me to those tests. I've updated them, please take a look and let me know if that's what you had in mind.

Deduplicated send and receive is deprecated.  To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand.  The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.

The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records.  We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file.  This is why the stream can not be a pipe.  The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`).  A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@ahrens ahrens added the Component: Send/Recv "zfs send/recv" feature label Apr 7, 2020
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, that's exactly what I had in mind for the tests.

#include <stddef.h>
#include <libzfs.h>
#include "zstream.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Double blank line

@@ -215,7 +205,7 @@ sprintf_bytes(char *str, uint8_t *buf, uint_t buf_len)
}

int
main(int argc, char *argv[])
zstream_do_dump(int argc, char *argv[])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you decide to have the zstream_do_* functions in separate files for this just because zstreamdump was already its own utility, or do you think this is a better design for things like zfs_do_* and zpool_do_* as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, it worked especially well because zstreamdump was already in its own file, and also the dump and redup functionalities each have a bunch of code, but don't share much code. It's probably a less clear win for zfs_main.c / zpool_main.c, but even so it probably would be cleaner to have those broken up into one file per subcommand as well. Originally we thought that zfs_main.c / zpool_main.c would be pretty thin, with most of the functionality in libzfs. That's still mostly the case, but a few of the subcommands have grown a bit unwieldy. This also relates to the proposal for a new, higher-level zfs API: https://openzfs.topicbox.com/groups/developer/Tdde1f0006baa1227-M4c1229e160c31935bc0ff42b


extern int zstream_do_redup(int, char *[]);
extern int zstream_do_dump(int, char *[]);
extern void usage(void);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having usage defined in a header is slightly awkward, since if some other program wants to include this header it may conflict with how they want to define their own usage function.

Copy link
Member Author

@ahrens ahrens Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think another program can include this header. It isn't installed (hence needing to use quotes to include it), and the functions aren't compiled into a library. It's only used by zstream. But I'll go ahead and rename it to zstream_usage().

}

fletcher_4_init();
int err = zfs_redup_stream(fd, STDOUT_FILENO, verbose);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we print error messages here for known error cases like ESPIPE?

Copy link
Member Author

@ahrens ahrens Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this isn't a library, I think we can actually remove the ESPIPE check. I think that the other error cases generally print to stderr and then exit. We could even make zfs_redup_stream() return void. (And if this function is used incorrectly, with a non-seekable fd, sfread() will print and exit.)

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 9, 2020
@codecov-io
Copy link

Codecov Report

Merging #10156 into master will decrease coverage by 0.41%.
The diff coverage is 70.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10156      +/-   ##
==========================================
- Coverage   79.36%   78.95%   -0.42%     
==========================================
  Files         385      387       +2     
  Lines      122589   122789     +200     
==========================================
- Hits        97290    96943     -347     
- Misses      25299    25846     +547     
Flag Coverage Δ
#kernel 79.72% <ø> (-0.01%) ⬇️
#user 62.74% <70.75%> (-3.21%) ⬇️
Impacted Files Coverage Δ
cmd/zstream/zstream_dump.c 51.33% <28.57%> (ø)
cmd/zstream/zstream.c 58.33% <58.33%> (ø)
cmd/zstream/zstream_redup.c 74.59% <74.59%> (ø)
lib/libzfs/libzfs_sendrecv.c 76.45% <100.00%> (-0.01%) ⬇️
module/os/linux/spl/spl-zlib.c 55.35% <0.00%> (-28.58%) ⬇️
module/zfs/vdev_indirect.c 74.00% <0.00%> (-11.00%) ⬇️
module/zfs/dsl_scan.c 79.54% <0.00%> (-6.10%) ⬇️
cmd/zvol_id/zvol_id_main.c 76.31% <0.00%> (-5.27%) ⬇️
module/lua/lmem.c 83.33% <0.00%> (-4.17%) ⬇️
module/zfs/arc.c 78.39% <0.00%> (-3.49%) ⬇️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e3df9d...7b84a1a. Read the comment docs.

@behlendorf behlendorf merged commit c618f87 into openzfs:master Apr 10, 2020
@behlendorf behlendorf mentioned this pull request Apr 10, 2020
12 tasks
@ahrens ahrens mentioned this pull request Apr 15, 2020
12 tasks
behlendorf pushed a commit that referenced this pull request Apr 23, 2020
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156
Closes #10212
as-com pushed a commit to as-com/zfs that referenced this pull request Jun 20, 2020
Deduplicated send and receive is deprecated.  To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand.  The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.

The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records.  We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file.  This is why the stream can not be a pipe.  The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`).  A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10124 
Closes openzfs#10156 
(cherry picked from commit c618f87)
as-com pushed a commit to as-com/zfs that referenced this pull request Jun 20, 2020
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue openzfs#7887
Issue openzfs#10117
Issue openzfs#10156
Closes openzfs#10212 
(cherry picked from commit 196bee4)
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Deduplicated send and receive is deprecated.  To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand.  The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.

The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records.  We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file.  This is why the stream can not be a pipe.  The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`).  A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10124 
Closes openzfs#10156
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue openzfs#7887
Issue openzfs#10117
Issue openzfs#10156
Closes openzfs#10212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants