Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add utility to convert deduplicated send stream to normal #10124

Closed
ahrens opened this issue Mar 12, 2020 · 2 comments
Closed

add utility to convert deduplicated send stream to normal #10124

ahrens opened this issue Mar 12, 2020 · 2 comments
Assignees
Labels
Component: Send/Recv "zfs send/recv" feature

Comments

@ahrens
Copy link
Member

ahrens commented Mar 12, 2020

Describe the problem you're observing

As described in #7887 and #10117, we will be deprecating deduplicated send and receive. To ease migration to the new dedup-send-less world, we will add a utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely.

This issue serves to discuss the interface to this utility, to achieve consensus prior to writing the manpages etc.

Proposed solution

The initial proposed interface is:

zstreamredup [-v] DEDUP_STREAM_FILE | ...

The DEDUP_STREAM_FILE argument is a file that contains a send stream generated by zfs send -D .... This includes incremental a full streams, as well as "replication" streams (generated by zfs send -RD ...).

The equivalent non-dedup ("redup'ed") send stream will be output on STDOUT.

The -v or --verbose flag will output details of the conversion process on STDERR. The output format will be human-readable and subject to change in the future.

The utility would typically be used like zstreamredup file.zstream | zfs receive ....

Alternatives considered

The command name is analogous to the existing zstreamdump, but it still feels inelegant to me, so I'd welcome suggestions for a better name.

Alternatives that I considered are zfs redup [-v] DEDUP_STREAM_FILE | ... or zfs send --redup [-v] DEDUP_STREAM_FILE | .... I prefer to add a new utility because (1) the argument type (a file) is very different from other zfs subcommands, and (2) the utility will typically not be used, and even when it is, its utility will be for a limited time. So although we intend to maintain the utility indefinitely, I didn't want to clutter the main user interface with this vestige of the past.

I also considered having zfs receive ... <FILE automatically do the conversion (provided that STDIN is seekable, i.e. not a pipe). However:

  1. The user interface is less clear, because of the additional requirement on STDIN, and the fact that it's common to use zfs receive with a non-seekable (e.g. pipe) STDIN.
  2. The implementation would be more complex. We would need to (only sometimes) create a new thread and pipe to process the input. (Note that we don't want the performance overhead of the additional pipe in the normal (non-dedup) case.) In the case of a send -R stream, we can't determine if we need to do this until we've read the 2nd BEGIN record. The code is structured such that it's nontrivial to change the input fd to the new pipe this late in the zfs receive process.

It might be reasonable for this functionality to be built into zstreamdump, e.g. zstreamdump --redup FILE | ... or zstreamdump --redup <file | .... However, this suffers the same problems of having different types of arguments than normal (taking the stream from a FILE argument vs STDIN), or new requirements on STDIN (that it be seekable). Additionally, it would have a different type of output (binary stream vs human-readable text). But given that zstreamdump is not part of the "main" user interface, I'd be OK with accepting one of these interface issues.

Current status

I've already implemented the core functionality in libzfs, and it seems to be working. I'd like to get consensus on the interface before tacking the "busywork" of hooking up the build system for a new command, writing a new manpage, etc.

Implementation details

The way this works under the hood is that as we read the send stream, we build up a hash table which maps from <GUID, object, offset> -> <file_offset>.

Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is drr_toguid, drr_object, drr_offset.)

For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key drr_refguid, drr_refobject, drr_refoffset), and then reading the record header and payload from the specified offset in the stream file. This is why we need the input to be seekable. The found WRITE record replaces the WRITE_BYREF record, with its drr_toguid, drr_object, and drr_offset fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE records (same as zfs send -D), but the size per WRITE record is relatively low (40 bytes, vs. 72 for zfs send -D). A 1TB send stream with 8KB blocks (recordsize=8k) would use around 5GB of RAM to "redup".

@ahrens ahrens self-assigned this Mar 12, 2020
@ahrens ahrens added Component: Send/Recv "zfs send/recv" feature Status: Feedback requested More information is requested labels Mar 12, 2020
@behlendorf
Copy link
Contributor

Regarding the interface, I agree with your observations above about the various options feeling a bit clunky.

I think that it would be desirable if we could avoid introducing a new custom utility for just this purpose. As an alternative, what do think about renaming zstreamdump to zstream and initially supporting the --dump and --redup options. --dump would behave as zstreamdump does today, and --redup would behave as zstreamredup is described in your proposal.

zstream [-Cvd] --dump [DEDUP_STREAM_FILE]
zstream [-v] --redup DEDUP_STREAM_FILE | ...

This would have to advantage that as a "new" utility we could logically extend it as needed to handle any of our future userspace stream processing needs. There have been more than a few good ideas suggested over the years.

While zstreamdump is not one of the "main" utilities, we could provide a zstreamdump wrapper script for backwards compatibility. Or even just install a zstreamdump symlink if the default zstream behavior was --dump.

One last consideration might be that zstream as a name isn't unique enough. I don't currently see any conflicts in the Ubuntu packages, but it would be easy mistake zstream as a utility provided by the zutils package (which provides zgrep, zcmp, zless, zdiff, etc.). That said, I still prefer it as a name over something like zfsstream.

@ahrens
Copy link
Member Author

ahrens commented Mar 23, 2020

@behlendorf I like the idea of consolidating the various send-stream processing utilities into one command. And as you indicated above, we can say that all the subcommands take a stream file as the argument, but some can also get the stream from STDIN. I'd like to make the different modes subcommands (like the zfs and zpool commands) rather than flags, so a summary like:

zstream dump [-Cvd] [FILE]
zstream redup [-v] FILE

@ahrens ahrens removed the Status: Feedback requested More information is requested label Mar 24, 2020
as-com pushed a commit to as-com/zfs that referenced this issue Jun 20, 2020
Deduplicated send and receive is deprecated.  To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand.  The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.

The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records.  We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file.  This is why the stream can not be a pipe.  The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`).  A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10124 
Closes openzfs#10156 
(cherry picked from commit c618f87)
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
Deduplicated send and receive is deprecated.  To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand.  The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.

The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records.  We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file.  This is why the stream can not be a pipe.  The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`).  A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10124 
Closes openzfs#10156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature
Projects
None yet
Development

No branches or pull requests

2 participants