-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zstream redup
command to convert deduplicated send streams
#10156
Conversation
716845e
to
5541b33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing juicy, compiles and runs on osx.
cmd/zstream/zstream_redup.c
Outdated
} redup_table_t; | ||
|
||
static int | ||
high_order_bit(uint64_t n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one hand, should we use highbit()
/ highbit64()` - since I had to add that to Windows porting layer already, but it is also nice that it's just part of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately highbit64 is not in libzfs. It's only in the kernel, libzpool, and the zpool command (zpool_util.c). Seems like something that could/should be moved to libzfs_util.c. For now I've at least made this consistent with the naming and definition of highbit64().
rdt.numhashbits = high_order_bit(numbuckets) - 1; | ||
|
||
char *buf = safe_calloc(bufsz); | ||
FILE *ofp = fdopen(infd, "r"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No error checking is probably ok, since we checked infd above.
cmd/zstream/zstream_redup.c
Outdated
/* | ||
* Typically the END record is either the last | ||
* thing in the stream, or it is followed | ||
* by a BEGIN record (which also zero's the cheksum). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksum - maybe even "zeros".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
5541b33
to
fb8310c
Compare
#include <stddef.h> | ||
#include <stddef.h> | ||
#include <stdio.h> | ||
#include <stdio.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stdio.h
and stddef.h
included twice.
cmd/zstream/zstream_redup.c
Outdated
highbit64(uint64_t i) | ||
{ | ||
if (i == 0) | ||
return (0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is indented to the wrong level.
cmd/zstream/zstream_redup.c
Outdated
if (!ISP2(numbuckets)) | ||
numbuckets = 1ULL << highbit64(numbuckets); | ||
|
||
rdt.redup_hash_array = calloc(numbuckets, sizeof (redup_entry_t *)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to use safe_calloc
here?
@@ -182,6 +182,7 @@ export ZFS_FILES='zdb | |||
dbufstat | |||
zed | |||
zgenhostid | |||
zstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide some test coverage for the new utility how about extending the existing rsend/send-cD.ksh
and cli_root/zfs_receive/zfs_receive_013_pos.ksh
to additionally use zstream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing me to those tests. I've updated them, please take a look and let me know if that's what you had in mind.
Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
0377db7
to
cca4be9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, that's exactly what I had in mind for the tests.
#include <stddef.h> | ||
#include <libzfs.h> | ||
#include "zstream.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Double blank line
@@ -215,7 +205,7 @@ sprintf_bytes(char *str, uint8_t *buf, uint_t buf_len) | |||
} | |||
|
|||
int | |||
main(int argc, char *argv[]) | |||
zstream_do_dump(int argc, char *argv[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you decide to have the zstream_do_* functions in separate files for this just because zstreamdump was already its own utility, or do you think this is a better design for things like zfs_do_* and zpool_do_* as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, it worked especially well because zstreamdump was already in its own file, and also the dump
and redup
functionalities each have a bunch of code, but don't share much code. It's probably a less clear win for zfs_main.c / zpool_main.c, but even so it probably would be cleaner to have those broken up into one file per subcommand as well. Originally we thought that zfs_main.c / zpool_main.c would be pretty thin, with most of the functionality in libzfs. That's still mostly the case, but a few of the subcommands have grown a bit unwieldy. This also relates to the proposal for a new, higher-level zfs API: https://openzfs.topicbox.com/groups/developer/Tdde1f0006baa1227-M4c1229e160c31935bc0ff42b
cmd/zstream/zstream.h
Outdated
|
||
extern int zstream_do_redup(int, char *[]); | ||
extern int zstream_do_dump(int, char *[]); | ||
extern void usage(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having usage defined in a header is slightly awkward, since if some other program wants to include this header it may conflict with how they want to define their own usage function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think another program can include this header. It isn't installed (hence needing to use quotes to include it), and the functions aren't compiled into a library. It's only used by zstream. But I'll go ahead and rename it to zstream_usage()
.
cmd/zstream/zstream_redup.c
Outdated
} | ||
|
||
fletcher_4_init(); | ||
int err = zfs_redup_stream(fd, STDOUT_FILENO, verbose); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we print error messages here for known error cases like ESPIPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this isn't a library, I think we can actually remove the ESPIPE check. I think that the other error cases generally print to stderr and then exit. We could even make zfs_redup_stream() return void
. (And if this function is used incorrectly, with a non-seekable fd, sfread()
will print and exit.)
Codecov Report
@@ Coverage Diff @@
## master #10156 +/- ##
==========================================
- Coverage 79.36% 78.95% -0.42%
==========================================
Files 385 387 +2
Lines 122589 122789 +200
==========================================
- Hits 97290 96943 -347
- Misses 25299 25846 +547
Continue to review full report at Codecov.
|
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #7887 Issue #10117 Issue #10156 Closes #10212
Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10124 Closes openzfs#10156 (cherry picked from commit c618f87)
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue openzfs#7887 Issue openzfs#10117 Issue openzfs#10156 Closes openzfs#10212 (cherry picked from commit 196bee4)
Deduplicated send and receive is deprecated. To ease migration to the new dedup-send-less world, the commit adds a `zstream redup` utility to convert deduplicated send streams to normal streams, so that they can continue to be received indefinitely. The new `zstream` command also replaces the functionality of `zstreamdump`, by way of the `zstream dump` subcommand. The `zstreamdump` command is replaced by a shell script which invokes `zstream dump`. The way that `zstream redup` works under the hood is that as we read the send stream, we build up a hash table which maps from `<GUID, object, offset> -> <file_offset>`. Whenever we see a WRITE record, we add a new entry to the hash table, which indicates where in the stream file to find the WRITE record for this block. (The key is `drr_toguid, drr_object, drr_offset`.) For entries other than WRITE_BYREF, we pass them through unchanged (except for the running checksum, which is recalculated). For WRITE_BYREF records, we change them to WRITE records. We find the referenced WRITE record by looking in the hash table (for the record with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading the record header and payload from the specified offset in the stream file. This is why the stream can not be a pipe. The found WRITE record replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`, and `drr_offset` fields changed to be the same as the WRITE_BYREF's (i.e. we are writing the same logical block, but with the data supplied by the previous WRITE record). This algorithm requires memory proportional to the number of WRITE records (same as `zfs send -D`), but the size per WRITE record is relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to "redup". Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10124 Closes openzfs#10156
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue openzfs#7887 Issue openzfs#10117 Issue openzfs#10156 Closes openzfs#10212
Motivation and Context
Deduplicated send and receive is deprecated. To ease migration to the
new dedup-send-less world, the commit adds a
zstream redup
utility toconvert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.
#10124
Description
The new
zstream
command also replaces the functionality ofzstreamdump
, by way of thezstream dump
subcommand. Thezstreamdump
command is replaced by a shell script which invokeszstream dump
.The way that
zstream redup
works under the hood is that as we read thesend stream, we build up a hash table which maps from
<GUID, object, offset> -> <file_offset>
.Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is
drr_toguid, drr_object, drr_offset
.)For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).
For WRITE_BYREF records, we change them to WRITE records. We find the
referenced WRITE record by looking in the hash table (for the record
with key
drr_refguid, drr_refobject, drr_refoffset
), and then readingthe record header and payload from the specified offset in the stream
file. This is why the stream can not be a pipe. The found WRITE record
replaces the WRITE_BYREF record, with its
drr_toguid
,drr_object
,and
drr_offset
fields changed to be the same as the WRITE_BYREF's(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).
This algorithm requires memory proportional to the number of WRITE
records (same as
zfs send -D
), but the size per WRITE record isrelatively low (40 bytes, vs. 72 for
zfs send -D
). A 1TB send streamwith 8KB blocks (
recordsize=8k
) would use around 5GB of RAM to"redup".
The new manpage is reproduced here:
How Has This Been Tested?
Manual testing. I'd also like to add some tests to the ZTS.
Types of changes
Checklist:
Signed-off-by
.