-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add utility to convert deduplicated send stream to normal #10124
Comments
Regarding the interface, I agree with your observations above about the various options feeling a bit clunky. I think that it would be desirable if we could avoid introducing a new custom utility for just this purpose. As an alternative, what do think about renaming
This would have to advantage that as a "new" utility we could logically extend it as needed to handle any of our future userspace stream processing needs. There have been more than a few good ideas suggested over the years. While One last consideration might be that |
@behlendorf I like the idea of consolidating the various send-stream processing utilities into one command. And as you indicated above, we can say that all the subcommands take a stream file as the argument, but some can also get the stream from STDIN. I'd like to make the different modes subcommands (like the zfs and zpool commands) rather than flags, so a summary like:
|
Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10124 Closes openzfs#10156 (cherry picked from commit c618f87)
Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10124 Closes openzfs#10156
Describe the problem you're observing
As described in #7887 and #10117, we will be deprecating deduplicated send and receive. To ease migration to the new dedup-send-less world, we will add a utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely.
This issue serves to discuss the interface to this utility, to achieve consensus prior to writing the manpages etc.
Proposed solution
The initial proposed interface is:
zstreamredup [-v] DEDUP_STREAM_FILE | ...
The
DEDUP_STREAM_FILE
argument is a file that contains a send stream generated byzfs send -D ...
. This includes incremental a full streams, as well as "replication" streams (generated byzfs send -RD ...
).The equivalent non-dedup ("redup'ed") send stream will be output on STDOUT.
The
-v
or--verbose
flag will output details of the conversion process on STDERR. The output format will be human-readable and subject to change in the future.The utility would typically be used like
zstreamredup file.zstream | zfs receive ...
.Alternatives considered
The command name is analogous to the existing
zstreamdump
, but it still feels inelegant to me, so I'd welcome suggestions for a better name.Alternatives that I considered are
zfs redup [-v] DEDUP_STREAM_FILE | ...
orzfs send --redup [-v] DEDUP_STREAM_FILE | ...
. I prefer to add a new utility because (1) the argument type (a file) is very different from otherzfs
subcommands, and (2) the utility will typically not be used, and even when it is, its utility will be for a limited time. So although we intend to maintain the utility indefinitely, I didn't want to clutter the main user interface with this vestige of the past.I also considered having
zfs receive ... <FILE
automatically do the conversion (provided that STDIN is seekable, i.e. not a pipe). However:zfs receive
with a non-seekable (e.g. pipe) STDIN.send -R
stream, we can't determine if we need to do this until we've read the 2nd BEGIN record. The code is structured such that it's nontrivial to change the input fd to the new pipe this late in thezfs receive
process.It might be reasonable for this functionality to be built into
zstreamdump
, e.g.zstreamdump --redup FILE | ...
orzstreamdump --redup <file | ...
. However, this suffers the same problems of having different types of arguments than normal (taking the stream from a FILE argument vs STDIN), or new requirements on STDIN (that it be seekable). Additionally, it would have a different type of output (binary stream vs human-readable text). But given thatzstreamdump
is not part of the "main" user interface, I'd be OK with accepting one of these interface issues.Current status
I've already implemented the core functionality in libzfs, and it seems to be working. I'd like to get consensus on the interface before tacking the "busywork" of hooking up the build system for a new command, writing a new manpage, etc.
Implementation details
The way this works under the hood is that as we read the send stream, we build up a hash table which maps from
<GUID, object, offset> -> <file_offset>
.Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is
drr_toguid, drr_object, drr_offset
.)For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated).
For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key
drr_refguid, drr_refobject, drr_refoffset
), and then reading the record header and payload from the specified offset in the stream file. This is why we need the input to be seekable. The found WRITE record replaces the WRITE_BYREF record, with itsdrr_toguid
,drr_object
, anddrr_offset
fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record).This algorithm requires memory proportional to the number of WRITE records (same as
zfs send -D
), but the size per WRITE record is relatively low (40 bytes, vs. 72 forzfs send -D
). A 1TB send stream with 8KB blocks (recordsize=8k
) would use around 5GB of RAM to "redup".The text was updated successfully, but these errors were encountered: