Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement corruption correcting recv #9323

Closed
wants to merge 1 commit into from

Conversation

alek-p
Copy link
Contributor

@alek-p alek-p commented Sep 13, 2019

This patch implements a new type of zfs receive: corrective receive (-c). This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a sendfile for example).
Metadata can not be healed using a corrective receive.

This patch enables us to receive a send stream into an existing snapshot for the purpose of correcting data corruption.

Motivation and Context

In the past in the rare cases where ZFS has experienced permanent data corruption, full recovery of the dataset(s) has not always been possible even if replicas existed.
This patch makes recovery from permanent data corruption possible.

Description

For every write and spill record in the send stream, we read the corresponding block from disk and if that read fails with a checksum error we overwrite that block with data from the send stream.
After the data is healed we reread the block to make sure it's healed and remove the healed blocks form the corruption lists seen in zpool status.

To makes sure will have correctly matched the data in the send stream to the right dataset to heal there is a restriction that the GUID for the snapshot being received into must match the GUID in the send stream. There are likely several snapshots referring to the same potentially corrupted data so there may be many snapshots with the above condition holding that are able to heal a single block.

The other thing to point out is that we can only correct data. Specifically, we are only able to heal records of type DRR_WRITE and DRR_SPILL since those are the only ones that contain all of the data needed to recreate the damaged block.

How Has This Been Tested?

I've been running unit testing very similar to the test that I've added to the zfs-tests

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@alek-p
Copy link
Contributor Author

alek-p commented Sep 13, 2019

The next logical extension for part two of this work is to provide a way for a corrupted pool to tell a backup system to generate a minimal send stream in such a way as to enable the corrupted pool to be healed with this generated send stream.
The interface could be something like the following, but maybe there are better suggestions?

# dumps spa err list that are part of this snapshot and the snapshot guid
zfs send -C data/fs@snap > /tmp/errlist 

# on replica system generates healing sendfile based on the errors list
zfs send -cc /tmp/errlist backup_data > /tmp/healing_sendfile

# heal our data with the minimal healing sendfile
zfs recv -c data/fs@snap < /tmp/healing_sendfile

@alek-p alek-p added Component: Send/Recv "zfs send/recv" feature Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion labels Sep 16, 2019
@alek-p alek-p force-pushed the healing_recv branch 2 times, most recently from f0859db to 3dd910a Compare September 17, 2019 21:52
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using zfs recv is this way to correct damaged blocks is an interesting idea. I've left some initial comments and we should be able to get you some additional feedback on the approach.


/* We can only heal write and spill records; other ones get ignored */
if (drr.drr_type != DRR_WRITE && drr.drr_type != DRR_SPILL)
goto cleanup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, but I'd suggest moving this check in to the top of receive_process_record for healing receives.

DMU_READ_NO_PREFETCH);
kmem_free(buf, lsize);
if (err != ECKSUM)
goto cleanup; /* no corruption found */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally this looks right, but it is possible for dmu_read to return errors other than ECKSUM, for example EIO. Could you add a comment to clarify it's also possible the block couldn't be read at all and was skipped,

break;
}
default:
ASSERT0(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be unreachable but please go ahead remove this debugging.


err = dmu_buf_hold_noread(os, obj, offset, FTAG, &dbp);
if (err != 0) {
err = SET_ERROR(EBUSY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why EBUSY rather than returning the actual errno?

@@ -2577,7 +2775,10 @@ receive_writer_thread(void *arg)
* can exit.
*/
if (rwa->err == 0) {
rwa->err = receive_process_record(rwa, rrd);
if (rwa->heal)
rwa->err = receive_heal_record(rwa, rrd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than add a new top-level receive_heal_record(), did you try moving this logic in to receive_write() and receive_spill() respectively. This will allow you to leverage all of the existing stream sanity checks, and the cleanup logic which consumes the arc_buf automatically on error. The common bits at the end of receive_heal_record() which rewrites in-place can be left in it's own function which is called by both.

module/zfs/spa_errlog.c Show resolved Hide resolved
module/zfs/zfs_ioctl.c Show resolved Hide resolved
{
log_must dd bs=512 count=1 if=garbage conv=notrunc \
oflag=sync of=$DEV_RDSKDIR/$DISK seek=$((($1 / 512) + (0x400000 / 512)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news, commit b63e2d8 added support to master to inject targeted damage in to file blocks. Please use it to ensure this test is reliable.

corrupt_offset "$mid_offset"

log_must zpool scrub $TESTPOOL
log_must sleep 5 # let scrub finish
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added to master is the new zpool wait subcommand, you can now run log_must zpool wait -t scrub $TESTPOOL and avoid this ugliness.

log_assert "ZFS corrective receive should be able to heal corruption"

# we will use this as the source of corruption
log_must dd if=/dev/urandom of=garbage bs=512 count=1 oflag=sync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get rid of the garbage file after switching to the targeted injection.

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with encryption and zfs send --raw? E.g. if the snapshot is encrypted, do they need to do a raw send, or does the receive encrypt it on the fly (using the crypt params in the BP)?

@@ -4487,7 +4487,7 @@ zfs_do_receive(int argc, char **argv)
nomem();

/* check options */
while ((c = getopt(argc, argv, ":o:x:dehnuvFsA")) != -1) {
while ((c = getopt(argc, argv, ":o:x:dehnuvFsAc")) != -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time to add --long-opts for zfs receive. It's really too bad we didn't do this from the beginning for all of the subcommands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part two of this work will likely need long-ops but I'd like to avoid adding it now

boolean_t force, boolean_t resumable, boolean_t raw, int input_fd,
const dmu_replay_record_t *begin_record, int cleanup_fd,
boolean_t force, boolean_t heal, boolean_t resumable, boolean_t raw,
int input_fd, const dmu_replay_record_t *begin_record, int cleanup_fd,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we don't want to change existing libzfs_core function signatures if we can help it. Let's add a new function instead.

* snapshot as the one we are trying to heal.
*/
struct drr_begin *drrb = drba->drba_cookie->drc_drrb;
error = dsl_dataset_hold_obj(dp, val, FTAG, &snap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if error is nonzero, shouldn't we be returning that?

@@ -361,12 +365,16 @@ recv_begin_check_existing_impl(dmu_recv_begin_arg_t *drba, dsl_dataset_t *ds,
if (dsl_dataset_has_resume_receive_state(ds))
return (SET_ERROR(EBUSY));

/* New snapshot name must not exist. */
/* New snapshot name must not exist if we're not healing it */
error = zap_lookup(dp->dp_meta_objset,
dsl_dataset_phys(ds)->ds_snapnames_zapobj,
drba->drba_cookie->drc_tosnap, 8, 1, &val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're now using val as the snapshot's object number, how about renaming it to reflect that, e.g. snapobj

if (avl_is_empty(&spa->spa_errlist_healed)) {
mutex_exit(&spa->spa_errlist_lock);
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this avl_is_empty case is needed - the right thing will happen below (i.e. the first call to avl_destroy_nodes will return NULL).

goto cleanup;
}
blkid =
dbuf_whichblock(DB_DNODE((dmu_buf_impl_t *)dbp), 0, offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use DB_DNODE_ENTER/EXIT around the DB_DNODE(), or better yet dmu_buf_dnode_enter() as you do below. Or if you're going to cast the dbuf, you might as well just dereference db_blkid

buf = kmem_alloc(lsize, KM_SLEEP);
/* Try to read the object to see if it needs healing */
err = dmu_read(os, obj, offset, lsize, buf,
DMU_READ_NO_PREFETCH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no prefetching?


/* Correct the corruption in place */
err = zio_wait(zio_rewrite(NULL, os->os_spa, 0, bp, abd, size, NULL,
NULL, ZIO_PRIORITY_SYNC_WRITE, flags, NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any check that the size is the same as the BP's psize (possibly rounded up to ashift)? Or at least that this can't write past the end of what's allocated for the BP?

if (arc_get_compression(rrd->arc_buf) != BP_GET_COMPRESS(bp)) {
/*
* The compression in the stream doesn't match what we had
* on disk; we need to re-compress the buf into the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to handle this case? Seems like we could say that it needs to be a zfs send --compressed stream to use it with zfs recv -c. Plus, trying to recompress it and get the exact same byte stream adds more restrictions on changing the checksum algorithm implementations, which we'd like to avoid. (cc @allanjude)


/* Correct the corruption in place */
err = zio_wait(zio_rewrite(NULL, os->os_spa, 0, bp, abd, size, NULL,
NULL, ZIO_PRIORITY_SYNC_WRITE, flags, NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current functionality, we would expect there to be a LOT more records for non-corrupt blocks, than for corrupt blocks. So I'm even more concerned about the synchronous (and not eligible for predictive prefetch) dmu_read()'s to determine if we want to correct this block.

We'd typically (e.g. with >1 vdev) get better performance by simply always asynchronously zio_rewrite()ing every record (i.e. without checking if its actually corrupt). Since you'd have Nrecords async writes vs the current PR's Nrecords sync reads. Plus the code would be a lot simpler.

I wonder if we should wait for the extensions (to send only the corrupt blocks) are ready, or at least design recv -c to work best in that mode. E.g. by assuming that nearly all records will be for corrupt blocks?

@alek-p alek-p force-pushed the healing_recv branch 2 times, most recently from 537c5bb to 90a196a Compare September 18, 2019 23:48
@alek-p
Copy link
Contributor Author

alek-p commented Sep 18, 2019

Thanks for the reviews guys. I've addressed most of the comments in the version I just pushed. I'm still thinking about the right way to do the rewrite I/O. Perhaps always rewriting (instead of readin first) is the right way to go hmm...

It seems to be too big a limitation to impose saying we have to use send --compressed with corrective recv so I'd like to keep the parts that are able to switch compressions if possible.
Perhaps we can deal with the compression algos that don't recompress the same way seperatley from the other compressison types and only make those send streams be compressed.

@codecov
Copy link

codecov bot commented Sep 19, 2019

Codecov Report

Merging #9323 into master will decrease coverage by 0.67%.
The diff coverage is 73.8%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9323      +/-   ##
==========================================
- Coverage   79.79%   79.12%   -0.68%     
==========================================
  Files         279      401     +122     
  Lines       81396   122676   +41280     
==========================================
+ Hits        64951    97065   +32114     
- Misses      16445    25611    +9166
Flag Coverage Δ
#kernel 79.77% <74.09%> (-0.02%) ⬇️
#user 66.63% <18.93%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afc8f0a...adb30d5. Read the comment docs.

@alek-p
Copy link
Contributor Author

alek-p commented Sep 20, 2019

After talking with my colleagues @datto we think it may be too dangerous to rewrite everything that we encounter in the send stream as writing has the potential to do more damage in the cases where corruption is coming from failing HW for example.
We want to prioritize getting the dataset healthy and not necessarily the performance of the healing. In theory, recv healing should be a rare event that does not need to be highly performant. Having said that I'm still working on the way that I/O is done for corrective recv.

The other open question was about the way to handle compression alogos that may not recompress the same block the same way. The suggestion here to avoid this problem we will only heal when the checksum of the data to be used for healing matches the checksum already on disk.

The last thing I'm still working on is to make sure raw and large block send streams are compatible with healing recv.

This patch implements a new type of zfs receive: corrective receive
(-c). Thistype of recv is used to heal corrupted data when a replica
of the data already exists (in the form of a sendfile for example).
Metadata can not be healed using a corrective receive.

Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
@ahrens
Copy link
Member

ahrens commented Oct 1, 2019

Superseded by #9372

@ahrens ahrens closed this Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants