Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disabled resilver_defer feature leads to looping resilvers #9299

Closed
KodyKantor opened this issue Sep 9, 2019 · 13 comments · Fixed by #9338
Closed

disabled resilver_defer feature leads to looping resilvers #9299

KodyKantor opened this issue Sep 9, 2019 · 13 comments · Fixed by #9338
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@KodyKantor
Copy link
Contributor

System information

smartos 20190718T005708Z

Describe the problem you're observing

When a disk is replaced with another on a pool with the resilver_defer feature present, but no enabled the resilver activity restarts during each spa_sync.

Describe how to reproduce the problem

  1. Create a pool of two-way mirrors on a system that does not have the resilver_defer feature.
  2. Update the system to one that supports resilver_defer, but do not enable the resilver_defer feature or otherwise upgrade the pool.
  3. Replace one side of the pool's mirror with another device.
  4. Watch resilver restart during each spa_sync (observed via DTrace on illumos).

The problem is described further here: https://smartos.org/bugview/OS-7982
A patch is available for SmartOS: TritonDataCenter/illumos-joyent@b67d873
After soaking in SmartOS for a short time the change will be upstreamed (including more code review) to illumos-gate as well.

This problem is also briefly described recently in #840, notably in @chrisrd's comment. This merits a separate ticket since the original issue reported in #840 is separate from this. If there's already a ticket open about this then we can close this ticket.

As noted in #840 upgrading the pool is a workaround for this issue for those willing to do so.

@ikozhukhov
Copy link
Contributor

ikozhukhov commented Sep 9, 2019

thanks for update!
i see the same problem with DilOS where on older pool (without resilver_defer feature) with replace of disk we have a loop with resilver.
but upgrade pool with enable of this feature fix this issue.
another issue with resilvering - releated to this feature:

  • create a pool with mirrored 3 vdev based on iscsi devices from different sources
  • produce non-stop writes to pool
  • offline one vdev on initiator and reboot one iscsi target host
  • when target host back - online vdev - wait a resilver process start
  • offline another vdev and reboot target
  • online vdev when target back
    of we try offline/online another vdev while one is in progress with scan/resilver -time to time we can see a loop with resilver of one vdev
    this issue can be fixed by zpool export <pool> -> zpool import <pool>
    after that, resilvering can be finished and will not starts in loop

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 9, 2019
@behlendorf
Copy link
Contributor

Thanks for digging in to this, the proposed fix makes sense to me but let's get @tcaputi's thoughts on it. It would be nice to add a test to the ZTS which covers this case.

@KodyKantor
Copy link
Contributor Author

I agree, a test would be great. Let me know if you have any thoughts about how we can do this sort of regression testing in an automated fashion. I don't think we're set up well for this in illumos yet.

@behlendorf
Copy link
Contributor

I'd suggest adapting one of the existing ZTS redundancy tests for this or just writing a new one. To automate this I'd suggest using zpool create -d to create a new pool with all of the feature flags disabled. Alternately, you could use zpool create -o feature@resilver_defer=disabled if you've ported e4010f2 and 9966754 to illumos. On the Linux side we run the full ZTS for each proposed change.

@ikozhukhov
Copy link
Contributor

ikozhukhov commented Sep 9, 2019

@behlendorf you can't reproduce this issue with new code and try to disable feature in creation time. for reproduce of this issue you have to create a pool based on old zfs code and load it to system with new zfs code with resilver_defer feature available. it is my experience with this issue, but maybe Kody has another examples.

@behlendorf
Copy link
Contributor

I see, I was under the impression the feature merely needed to be disabled. In the past we've handled this on a case-by-case basis by including a tiny pool created with the relevant code.

@tcaputi
Copy link
Contributor

tcaputi commented Sep 10, 2019

I believe that the proposed fix should solve the problem and is correct. Thanks @KodyKantor for the help here.

@KodyKantor
Copy link
Contributor Author

Great, thanks. I'll see if I can put a test together for this as well.

@tcaputi
Copy link
Contributor

tcaputi commented Sep 18, 2019

@KodyKantor how has testing for this come along? We would like to start using this patch if its correct.

@KodyKantor
Copy link
Contributor Author

Testing has been mixed. It appears Igor may be correct about how using the {{-d}} flag prevents the issue from occuring, but I'm not sure why that is. I also haven't been able to make a pool small enough to check in.

I'll get back to trying to write a good test for this tomorrow morning. If I can't figure something out by mid-day I'll just submit a PR as-is.

KodyKantor added a commit to KodyKantor/zfs that referenced this issue Sep 19, 2019
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982. Closes openzfs#9299.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
@KodyKantor
Copy link
Contributor Author

I submitted a PR for this. I didn't have time to get back to trying to write a test for this, sorry about that. I'll be away from my computer until September 30th, but by all means feel free to change this or integrate it in the meantime. If this is still up on the 30th I'll continue pushing this through.

@ikozhukhov
Copy link
Contributor

ikozhukhov commented Sep 22, 2019

i didn't test update in PR, i have prepared steps how to reproduce resilver loop.

  1. download http://apt2.dilos.org/dilos/zol/t1.dsk.gz
  2. decompress it: pizs -d t1.dsk.gz
  3. import it: zpool import -d <full path to dir with file> t1
  4. create 2 additional disks: truncate -s 2g d2.dsk && truncate -s 2g d3.dsk
  5. run simple script:
zpool attach t1 /ws/t1.dsk /ws/d2.dsk
zpool attach t1 /ws/t1.dsk /ws/d3.dsk
zpool status t1
  1. monitor status: watch zpool status t1

i can see resilver in loop - not finished
prepared test based on dilos-2.0.1.63 - where created pool and fill it by dd if=/dev/urandom - older release without ZoL features.
import pool to latest 2.0.2.39 RC1 with several ZoL features, included resilver_defer

@ikozhukhov
Copy link
Contributor

just tested proposed update based on TritonDataCenter/illumos-joyent@b67d873 - it fixed this resilvering loop.
@KodyKantor thank you very much for fix it!

behlendorf pushed a commit that referenced this issue Sep 22, 2019
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes #9299 
Closes #9338
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 23, 2019
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes openzfs#9299
Closes openzfs#9338
tonyhutter pushed a commit that referenced this issue Sep 26, 2019
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes #9299
Closes #9338
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Oct 19, 2019
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes openzfs#9299
Closes openzfs#9338
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants