fix(grub): adapt rd.retry to also trigger initqueue timeout tasks #11

qby-wenzel · 2021-11-03T06:39:59Z

Bugfix description

The default value of rd.retry is 180 seconds, although it is stated wrong or better to say outdated in the man page. As long as the rd.timeout is way below this value, the initqueue timeout hook of dracut will never be triggered. It is calculated by the fomular:
initqueue timeout = (2 * rd.retry) / 3

Note: Don't get fooled by the times two multiplication of RDRETRY, the loop is triggered every 0.5 seconds ;)

In the following scenario, a system boot will fail due to lowering rd.timeout to 60 seconds.

Error scenario

System with two raid1 (md0+md1) partitions. One disk fails and system tries to boot:

udev discovers raid partitions and starts to assemble them incrementally
both raid partitions stay degraded because second disk is missing
rd.retry will never expire to trigger the initqueue hook and so it won't execute /sbin/mdraid_start to fix the problem
dracut timeouts and falls into emergency shell
manually starting the degraded raid or executing the mdraid_start script is necessary

NB: There is currently a bug in dracut, which prevents starting a system with two or more degraded raid partitions. PR already submitted.

The default value of rd.retry is 180 seconds. As long as the rd.timeout is way below this value, the initqueue timeout hook of dracut will never be triggered: initqueue timeout = (2 * rd.retry) / 3

laenion · 2021-11-11T17:58:33Z

The default rd.retry value mentioned in the dracut.cmdline man page is wrong indeed - thanks for noticing and the fix!
Do you want to open a pull request to fix the dracut documentation yourself?

Regarding the fix I'm wondering whether it would make sense to increase the value of rd.timeout instead of lowering rd.retry - I guess when the default value of rd.retry was increased it may have been because of hardware that really requires that much time...

qby-wenzel · 2021-11-12T06:44:35Z

@laenion Not sure which hardware needs so much time to boot up? Three minutes is quite a lot. Network access? Anyway, maybe it's something like, better safe than sorry. The otherway around, I was thinking similarly about the 60 seconds, which was set by the health-checker. Why lowering it specifically to 60 seconds. I guess regarding timeouts, there is no perfect time to set.

If we agree on increasing the rd.timeout instead, it should be higher than the current value of rd.retry as stated in the man page ("Note that this timeout should be longer than rd.retry to allow for proper configuration."). Maybe it's even better, to still set both values - to have future changes in dracut covered. Althought last change was 2014 ;)

Well, what's about these: rd.timeout=210 and rd.retry=180

And ... I guess, I forget about fixing the man page. I just fixed the bug which breaks the raid recovery. I pushed another PR to fix the man page of dracut, too.

laenion · 2021-11-12T17:54:51Z

I tried with the longer timeout / retry values and noticed that something else is starting an emergency shell before the rd.timeout would have been triggered, so that doesn't help. Let's stick with the shorter times, I'll merge your pull request as it is.

Regarding the 60 seconds: I used the (wrong) default value from the man page as a reference and just duplicated it because of the this timeout should be longer than rd.retry to allow for proper configuration section.

qby-wenzel · 2021-11-15T05:45:34Z

Thanks for the merge and explanation. Let's hope, that the dracut fix will be merged soon, so that the current degraded raid fail situation will be resolved, too...

fix(grub): adapt rd.retry to also trigger initqueue timeout tasks

a09119e

The default value of rd.retry is 180 seconds. As long as the rd.timeout is way below this value, the initqueue timeout hook of dracut will never be triggered: initqueue timeout = (2 * rd.retry) / 3

laenion merged commit b28d6b0 into openSUSE:master Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grub): adapt rd.retry to also trigger initqueue timeout tasks #11

fix(grub): adapt rd.retry to also trigger initqueue timeout tasks #11

qby-wenzel commented Nov 3, 2021

laenion commented Nov 11, 2021

qby-wenzel commented Nov 12, 2021

laenion commented Nov 12, 2021

qby-wenzel commented Nov 15, 2021

fix(grub): adapt rd.retry to also trigger initqueue timeout tasks #11

fix(grub): adapt rd.retry to also trigger initqueue timeout tasks #11

Conversation

qby-wenzel commented Nov 3, 2021

Bugfix description

Error scenario

laenion commented Nov 11, 2021

qby-wenzel commented Nov 12, 2021

laenion commented Nov 12, 2021

qby-wenzel commented Nov 15, 2021