Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTL-1819 Resolve Failed Wait Condition #40

Merged
merged 1 commit into from
Jun 27, 2022

Conversation

rustydb
Copy link
Contributor

@rustydb rustydb commented Jun 27, 2022

Summary and Scope

Issue Type

  • Bugfix Pull Request

Currently, netboots using metal.no-wipe=1 may or may not succeed. This PR fixes a race condition where our k8s-master and k8s-worker modules never exit while waiting for the wipe function to end.

metal-md-disks.sh had an oversight, when checking metal.no-wipe=1 the /tmp/metalpave.done file was never created. This prevented the metal_paved function from correctly reporting whether the pave had actually finished or had been skipped. This failure causes the dependent modules running on k8s-masters and k8s-workers to never exit properly when metal.no-wipe=1 was set on netboots.

Lastly there are a few minor tweaks in this commit:

  • This also fixes the pave function's 5-second timer which was actually only giving the user 4 seconds due to using -gt instead of -ge.
  • Additionally this fixes the log truncation in the pave function, the ending line stating 'pave done' was truncating the log-file.
  • Adds a metal.wipe-delay kernel argument for changing the delay-time from 5 seconds to anything between 2 and 60 seconds.
  • Lastly this moves /tmp/metalpave.done into a global variable set in metal-lib.sh, making it less error prone to typo mistakes.

Prerequisites

  • I have included documentation in my PR (or it is not required)
  • I tested this on internal system (if yes, please include results or a description of the test)
  • I tested this on a vshasta system (if yes, please include results or a description of the test)

Idempotency

Risks and Mitigations

This removes risk of dependent dracut modules from running indefinitely, causing the boot to stall.

@rustydb rustydb changed the title MTL-1819 Label the squashFS store during creation MTL-1819 Resolve Failed Wait Condition Jun 27, 2022
@rustydb rustydb force-pushed the MTL-1819-reboot-race-condition branch from cc5f0cf to 7663848 Compare June 27, 2022 00:30
@rustydb rustydb requested review from a team, heemstra and jpdavis-prof June 27, 2022 01:18
@rustydb rustydb force-pushed the MTL-1819-reboot-race-condition branch 3 times, most recently from 544dc89 to 00e7e55 Compare June 27, 2022 02:06
@rustydb rustydb force-pushed the MTL-1819-reboot-race-condition branch from 00e7e55 to f22590e Compare June 27, 2022 15:20
@rustydb rustydb requested a review from heemstra June 27, 2022 15:20
@rustydb rustydb force-pushed the MTL-1819-reboot-race-condition branch 4 times, most recently from aaefe54 to eb17689 Compare June 27, 2022 15:40
metal-md-disks.sh had an oversight, when checking metal.no-wipe=1 the
/tmp/metalpave.done file was never created. This prevented the
metal_paved function from correctly reporting whether the pave had
actually finished or had been skipped. This failure causes the dependent
modules running on k8s-masters and k8s-workers to never exit properly
when metal.no-wipe=1 was set.

Lastly there are a few minor tweaks in this commit:
- This also fixes the pave function's 5-second timer which was actually
only giving the user 4 seconds due to using -gt instead of -ge.
- Additionally this fixes the log truncation in the pave function, the
ending line stating 'pave done' was truncating the log-file.
- Adds a metal.wipe-delay kernel argument for changing the delay-time
  from 5 seconds to anything between 2 and 60 seconds.
- Lastly this moves /tmp/metalpave.done into a global variable set in
metal-lib.sh, making it less error prone to typo mistakes.
@rustydb rustydb force-pushed the MTL-1819-reboot-race-condition branch from eb17689 to cdb9892 Compare June 27, 2022 17:03
@rustydb rustydb merged commit 2b92a48 into main Jun 27, 2022
@rustydb rustydb deleted the MTL-1819-reboot-race-condition branch June 27, 2022 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants