Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(grub): adapt rd.retry to also trigger initqueue timeout tasks #11

Merged
merged 1 commit into from
Nov 12, 2021

Conversation

qby-wenzel
Copy link
Contributor

Bugfix description

The default value of rd.retry is 180 seconds, although it is stated wrong or better to say outdated in the man page. As long as the rd.timeout is way below this value, the initqueue timeout hook of dracut will never be triggered. It is calculated by the fomular:
initqueue timeout = (2 * rd.retry) / 3

Note: Don't get fooled by the times two multiplication of RDRETRY, the loop is triggered every 0.5 seconds ;)

In the following scenario, a system boot will fail due to lowering rd.timeout to 60 seconds.

Error scenario

System with two raid1 (md0+md1) partitions. One disk fails and system tries to boot:

  • udev discovers raid partitions and starts to assemble them incrementally
  • both raid partitions stay degraded because second disk is missing
  • rd.retry will never expire to trigger the initqueue hook and so it won't execute /sbin/mdraid_start to fix the problem
  • dracut timeouts and falls into emergency shell
  • manually starting the degraded raid or executing the mdraid_start script is necessary

NB: There is currently a bug in dracut, which prevents starting a system with two or more degraded raid partitions. PR already submitted.

The default value of rd.retry is 180 seconds. As long as the
rd.timeout is way below this value, the initqueue timeout hook
of dracut will never be triggered:
initqueue timeout = (2 * rd.retry) / 3
@laenion
Copy link
Contributor

laenion commented Nov 11, 2021

The default rd.retry value mentioned in the dracut.cmdline man page is wrong indeed - thanks for noticing and the fix!
Do you want to open a pull request to fix the dracut documentation yourself?

Regarding the fix I'm wondering whether it would make sense to increase the value of rd.timeout instead of lowering rd.retry - I guess when the default value of rd.retry was increased it may have been because of hardware that really requires that much time...

@qby-wenzel
Copy link
Contributor Author

@laenion Not sure which hardware needs so much time to boot up? Three minutes is quite a lot. Network access? Anyway, maybe it's something like, better safe than sorry. The otherway around, I was thinking similarly about the 60 seconds, which was set by the health-checker. Why lowering it specifically to 60 seconds. I guess regarding timeouts, there is no perfect time to set.

If we agree on increasing the rd.timeout instead, it should be higher than the current value of rd.retry as stated in the man page ("Note that this timeout should be longer than rd.retry to allow for proper configuration."). Maybe it's even better, to still set both values - to have future changes in dracut covered. Althought last change was 2014 ;)

Well, what's about these: rd.timeout=210 and rd.retry=180

And ... I guess, I forget about fixing the man page. I just fixed the bug which breaks the raid recovery. I pushed another PR to fix the man page of dracut, too.

@laenion
Copy link
Contributor

laenion commented Nov 12, 2021

I tried with the longer timeout / retry values and noticed that something else is starting an emergency shell before the rd.timeout would have been triggered, so that doesn't help. Let's stick with the shorter times, I'll merge your pull request as it is.

Regarding the 60 seconds: I used the (wrong) default value from the man page as a reference and just duplicated it because of the this timeout should be longer than rd.retry to allow for proper configuration section.

@laenion laenion merged commit b28d6b0 into openSUSE:master Nov 12, 2021
@qby-wenzel
Copy link
Contributor Author

Thanks for the merge and explanation. Let's hope, that the dracut fix will be merged soon, so that the current degraded raid fail situation will be resolved, too...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants