Newly attached device is resilvered multiple times #9155

jgallag88 · 2019-08-12T20:06:10Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	18.04
Linux Kernel	4.15.0
Architecture	x86-64
ZFS Version	delphix@3eebcae
SPL Version	delphix@3eebcae

Describe the problem you're observing

When a device is attached to a pool, it sometimes ends up being resilvered twice. A resilver will be kicked off, and when it completes, it will start all over again the next txg. This seems to happen about half the time.

Describe how to reproduce the problem

Create a pool with a bit of data in it

$ sudo zpool create testpool xvdc
$ sudo dd if=/dev/urandom of=/testpool/file1 bs=1M count=4096

Then replace one of the devices in the pool

$ sudo zpool replace testpool xvdc xvdb

and watch the output of zpool status testpool 1. It will begin resilvering the new device

  pool: testpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Aug 12 19:48:32 2019
        4.00G scanned at 455M/s, 1.28G issued at 145M/s, 4.00G total
        1.27G resilvered, 31.90% done, 0 days 00:00:19 to go
config:

        NAME           STATE     READ WRITE CKSUM
        testpool       ONLINE       0     0     0
          replacing-0  ONLINE       0     0     0
            xvdc       ONLINE       0     0     0
            xvdb       ONLINE       0     0     0  (resilvering)

it will finish resilvering the device

  pool: testpool
 state: ONLINE
  scan: resilvered 4.01G in 0 days 00:00:32 with 0 errors on Mon Aug 12 19:49:04 2019
config:

        NAME           STATE     READ WRITE CKSUM
        testpool       ONLINE       0     0     0
          replacing-0  ONLINE       0     0     0
            xvdc       ONLINE       0     0     0
            xvdb       ONLINE       0     0     0

then begin again

  pool: testpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Aug 12 19:49:10 2019
        4.00G scanned at 4.00G/s, 112M issued at 112M/s, 4.00G total
        104M resilvered, 2.72% done, 0 days 00:00:35 to go
config:

        NAME           STATE     READ WRITE CKSUM
        testpool       ONLINE       0     0     0
          replacing-0  ONLINE       0     0     0
            xvdc       ONLINE       0     0     0
            xvdb       ONLINE       0     0     0  (resilvering)

If you are doing a replace, the old device is detached after the second resilver completes

  pool: testpool
 state: ONLINE
  scan: resilvered 4.01G in 0 days 00:00:32 with 0 errors on Mon Aug 12 19:49:42 2019
config:

        NAME        STATE     READ WRITE CKSUM
        testpool    ONLINE       0     0     0
          xvdb      ONLINE       0     0     0

This doesn't happen every time, but on my system it doesn't seem to take more than 2 or 3 attempts to be able to reproduce the issue.

Include any warning/errors/backtraces from the system logs

The text was updated successfully, but these errors were encountered:

jgallag88 · 2019-09-12T18:45:10Z

What's happening is that when the new device is attached, zed receives an EC_DEV_STATUS.ESC_DEV_DLE event, which can cause it to reopen the pool. The reopen logic calls vdev_open(), which includes

    /*
     * If a leaf vdev has a DTL, and seems healthy, then kick off a
     * resilver.  But don't do this if we are doing a reopen for a scrub,
     * since this would just restart the scrub we are already doing.
     */
    if (vd->vdev_ops->vdev_op_leaf && !spa->spa_scrub_reopen &&
        vdev_resilver_needed(vd, NULL, NULL)) {
        if (dsl_scan_resilvering(spa->spa_dsl_pool) &&
            spa_feature_is_enabled(spa, SPA_FEATURE_RESILVER_DEFER))
            vdev_set_deferred_resilver(spa, vd);
        else
            spa_async_request(spa, SPA_ASYNC_RESILVER);
    }

When this logic runs for the new device, vdev_resilver_needed() is true, and because we just started a resilver when we attached the device, dsl_scan_resilvering() is true too. So we end calling vdev_set_deferred_resilver(), which is what triggers the next resilver after the first one finishes.

This can be reproduced by manually running zpool reopen while a resilver is in progress.

hyegeek · 2019-10-17T16:20:48Z

Is there a way to stop the resilver loop once it starts? My server is currently resilving over and over again and each one is taking 24 or or more hours to do. The server is very slow with all of the disk traffic and it is causing a huge impact.

I'm currently running 0.8.2 on kernel 4.19.78.

jgallag88 · 2019-10-21T19:34:35Z

@hyegeek For this particular issue, the resilver will only restart if something is reopening the pool. In my case it was zed, but it could also be something else.

hyegeek · 2019-10-21T20:01:31Z

My resilver finally finished once I killed zed and allowed it to resilver two more times. Once because zed was running at the beginning and the second time to clean things up. At over 24 hours for the resilver, this was painful.

So, this really seems like a bug. zed is need (I think) to report when there are issues, yet if it is running, it keeps you from recovering.

Am I missing something?

jgallag88 · 2019-10-21T20:10:40Z

Yes, this is a bug. I'm not too familiar with this code, and I haven't had a chance to track down how this behavior was introduced and what the correct fix would be.

hyegeek · 2019-10-21T20:17:27Z

Thanks. Now that I know how to work around it, I can work on getting my blood pressure back down. 😀

Other things I've noticed about zed might (or might not) be helpful is hunting this. On some of my systems, it seems that ZED causes the kernel to constantly re-enumerate the disks. The server I had the resilver problem with is one such system. Every few minutes my kernel log lists the disks like it just found them. The log entries stop if I kill zed.

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: John Poduska <jpoduska@datto.com> Issue openzfs#840 Closes openzfs#9155 Closes openzfs#9378 Closes openzfs#9551 Closes openzfs#9588

If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: John Poduska <jpoduska@datto.com> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588

jgallag88 mentioned this issue Aug 14, 2019

Add subcommand to wait for background zfs activity to complete #9162

Merged

12 tasks

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 12, 2019

jgallag88 mentioned this issue Sep 30, 2019

Spurious re-silver to same device immediately after previous re-silver completed #9378

Closed

jwpoduska mentioned this issue Nov 14, 2019

Prevent unnecessary resilver restarts #9588

Merged

12 tasks

behlendorf closed this as completed in 3c819a2 Nov 27, 2019

i3v mentioned this issue Feb 25, 2024

"attempt to access beyond end of device" and devices failing #15932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newly attached device is resilvered multiple times #9155

Newly attached device is resilvered multiple times #9155

jgallag88 commented Aug 12, 2019

jgallag88 commented Sep 12, 2019

hyegeek commented Oct 17, 2019

jgallag88 commented Oct 21, 2019

hyegeek commented Oct 21, 2019

jgallag88 commented Oct 21, 2019

hyegeek commented Oct 21, 2019

Newly attached device is resilvered multiple times #9155

Newly attached device is resilvered multiple times #9155

Comments

jgallag88 commented Aug 12, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

jgallag88 commented Sep 12, 2019

hyegeek commented Oct 17, 2019

jgallag88 commented Oct 21, 2019

hyegeek commented Oct 21, 2019

jgallag88 commented Oct 21, 2019

hyegeek commented Oct 21, 2019