add utility to convert deduplicated send stream to normal #10124

ahrens · 2020-03-12T05:47:40Z

Describe the problem you're observing

As described in #7887 and #10117, we will be deprecating deduplicated send and receive. To ease migration to the new dedup-send-less world, we will add a utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely.

This issue serves to discuss the interface to this utility, to achieve consensus prior to writing the manpages etc.

Proposed solution

The initial proposed interface is:

zstreamredup [-v] DEDUP_STREAM_FILE | ...

The DEDUP_STREAM_FILE argument is a file that contains a send stream generated by zfs send -D .... This includes incremental a full streams, as well as "replication" streams (generated by zfs send -RD ...).

The equivalent non-dedup ("redup'ed") send stream will be output on STDOUT.

The -v or --verbose flag will output details of the conversion process on STDERR. The output format will be human-readable and subject to change in the future.

The utility would typically be used like zstreamredup file.zstream | zfs receive ....

Alternatives considered

The command name is analogous to the existing zstreamdump, but it still feels inelegant to me, so I'd welcome suggestions for a better name.

Alternatives that I considered are zfs redup [-v] DEDUP_STREAM_FILE | ... or zfs send --redup [-v] DEDUP_STREAM_FILE | .... I prefer to add a new utility because (1) the argument type (a file) is very different from other zfs subcommands, and (2) the utility will typically not be used, and even when it is, its utility will be for a limited time. So although we intend to maintain the utility indefinitely, I didn't want to clutter the main user interface with this vestige of the past.

I also considered having zfs receive ... <FILE automatically do the conversion (provided that STDIN is seekable, i.e. not a pipe). However:

The user interface is less clear, because of the additional requirement on STDIN, and the fact that it's common to use zfs receive with a non-seekable (e.g. pipe) STDIN.
The implementation would be more complex. We would need to (only sometimes) create a new thread and pipe to process the input. (Note that we don't want the performance overhead of the additional pipe in the normal (non-dedup) case.) In the case of a send -R stream, we can't determine if we need to do this until we've read the 2nd BEGIN record. The code is structured such that it's nontrivial to change the input fd to the new pipe this late in the zfs receive process.

It might be reasonable for this functionality to be built into zstreamdump, e.g. zstreamdump --redup FILE | ... or zstreamdump --redup <file | .... However, this suffers the same problems of having different types of arguments than normal (taking the stream from a FILE argument vs STDIN), or new requirements on STDIN (that it be seekable). Additionally, it would have a different type of output (binary stream vs human-readable text). But given that zstreamdump is not part of the "main" user interface, I'd be OK with accepting one of these interface issues.

Current status

I've already implemented the core functionality in libzfs, and it seems to be working. I'd like to get consensus on the interface before tacking the "busywork" of hooking up the build system for a new command, writing a new manpage, etc.

Implementation details

The way this works under the hood is that as we read the send stream, we build up a hash table which maps from <GUID, object, offset> -> <file_offset>.

Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is drr_toguid, drr_object, drr_offset.)

For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key drr_refguid, drr_refobject, drr_refoffset), and then reading the record header and payload from the specified offset in the stream file. This is why we need the input to be seekable. The found WRITE record replaces the WRITE_BYREF record, with its drr_toguid, drr_object, and drr_offset fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE records (same as zfs send -D), but the size per WRITE record is relatively low (40 bytes, vs. 72 for zfs send -D). A 1TB send stream with 8KB blocks (recordsize=8k) would use around 5GB of RAM to "redup".

The text was updated successfully, but these errors were encountered:

behlendorf · 2020-03-20T16:52:45Z

Regarding the interface, I agree with your observations above about the various options feeling a bit clunky.

I think that it would be desirable if we could avoid introducing a new custom utility for just this purpose. As an alternative, what do think about renaming zstreamdump to zstream and initially supporting the --dump and --redup options. --dump would behave as zstreamdump does today, and --redup would behave as zstreamredup is described in your proposal.

zstream [-Cvd] --dump [DEDUP_STREAM_FILE]
zstream [-v] --redup DEDUP_STREAM_FILE | ...

This would have to advantage that as a "new" utility we could logically extend it as needed to handle any of our future userspace stream processing needs. There have been more than a few good ideas suggested over the years.

While zstreamdump is not one of the "main" utilities, we could provide a zstreamdump wrapper script for backwards compatibility. Or even just install a zstreamdump symlink if the default zstream behavior was --dump.

One last consideration might be that zstream as a name isn't unique enough. I don't currently see any conflicts in the Ubuntu packages, but it would be easy mistake zstream as a utility provided by the zutils package (which provides zgrep, zcmp, zless, zdiff, etc.). That said, I still prefer it as a name over something like zfsstream.

ahrens · 2020-03-23T15:36:38Z

@behlendorf I like the idea of consolidating the various send-stream processing utilities into one command. And as you indicated above, we can say that all the subcommands take a stream file as the argument, but some can also get the stream from STDIN. I'd like to make the different modes subcommands (like the zfs and zpool commands) rather than flags, so a summary like:

zstream dump [-Cvd] [FILE]
zstream redup [-v] FILE

Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10124 Closes openzfs#10156 (cherry picked from commit c618f87)

Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10124 Closes openzfs#10156

ahrens self-assigned this Mar 12, 2020

ahrens added Component: Send/Recv "zfs send/recv" feature Status: Feedback requested More information is requested labels Mar 12, 2020

ahrens removed the Status: Feedback requested More information is requested label Mar 24, 2020

This was referenced Mar 25, 2020

Compile cityhash code into libzfs #10152

Merged

Add zstream redup command to convert deduplicated send streams #10156

Merged

behlendorf closed this as completed in c618f87 Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add utility to convert deduplicated send stream to normal #10124

add utility to convert deduplicated send stream to normal #10124

ahrens commented Mar 12, 2020

behlendorf commented Mar 20, 2020

ahrens commented Mar 23, 2020

add utility to convert deduplicated send stream to normal #10124

add utility to convert deduplicated send stream to normal #10124

Comments

ahrens commented Mar 12, 2020

Describe the problem you're observing

Proposed solution

Alternatives considered

Current status

Implementation details

behlendorf commented Mar 20, 2020

ahrens commented Mar 23, 2020