-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newly attached device is resilvered multiple times #9155
Comments
What's happening is that when the new device is attached,
When this logic runs for the new device, This can be reproduced by manually running |
Is there a way to stop the resilver loop once it starts? My server is currently resilving over and over again and each one is taking 24 or or more hours to do. The server is very slow with all of the disk traffic and it is causing a huge impact. I'm currently running 0.8.2 on kernel 4.19.78. |
@hyegeek For this particular issue, the resilver will only restart if something is reopening the pool. In my case it was |
My resilver finally finished once I killed zed and allowed it to resilver two more times. Once because zed was running at the beginning and the second time to clean things up. At over 24 hours for the resilver, this was painful. So, this really seems like a bug. zed is need (I think) to report when there are issues, yet if it is running, it keeps you from recovering. Am I missing something? |
Yes, this is a bug. I'm not too familiar with this code, and I haven't had a chance to track down how this behavior was introduced and what the correct fix would be. |
Thanks. Now that I know how to work around it, I can work on getting my blood pressure back down. 😀 Other things I've noticed about zed might (or might not) be helpful is hunting this. On some of my systems, it seems that ZED causes the kernel to constantly re-enumerate the disks. The server I had the resilver problem with is one such system. Every few minutes my kernel log lists the disks like it just found them. The log entries stop if I kill zed. |
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: John Poduska <jpoduska@datto.com> Issue openzfs#840 Closes openzfs#9155 Closes openzfs#9378 Closes openzfs#9551 Closes openzfs#9588
If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: John Poduska <jpoduska@datto.com> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588
System information
Describe the problem you're observing
When a device is attached to a pool, it sometimes ends up being resilvered twice. A resilver will be kicked off, and when it completes, it will start all over again the next txg. This seems to happen about half the time.
Describe how to reproduce the problem
Create a pool with a bit of data in it
Then replace one of the devices in the pool
and watch the output of
zpool status testpool 1
. It will begin resilvering the new deviceit will finish resilvering the device
then begin again
If you are doing a replace, the old device is detached after the second resilver completes
This doesn't happen every time, but on my system it doesn't seem to take more than 2 or 3 attempts to be able to reproduce the issue.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: