-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VSHA-536 Zero md
Superblocks
#58
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
md
Superblocks
84b70fd
to
27c2777
Compare
heemstra
approved these changes
Feb 19, 2023
27c2777
to
979bf5a
Compare
erl-hpe
approved these changes
Feb 19, 2023
This change ensures the RAIDs are eradicated and made unrecognizable by erasing their superblocks in order to resolve sync timing problems between Vshasta and metal. The new logic explicitly stops the `md` devices, wipes their magic bits, and then eradicates the `md` superblocks on each disk. During testing of VSHA-536 there was some fiddling with how the RAIDs were wiped to account for some peculiarities with the timings of how `virtio` synced and updated the kernel. The changes had been tested on metal without any observed problems, but in my recent series of tests some fatal inconsistencies were observed. The `partprobe` was revealing `md` handles, this caused `mdadm` to restart/resume RAIDs that had been "nuked" and this in turn caused partitioning to fail. This change also includes some minor fixes: - The `wipefs` command for sd/nvme devices was not getting piped to the log file. - The info printed when manually sourcing `/lib/metal-md-lib.sh` in a dracut shell is now left justified and aligned by colon. - The extra `/sbin/metal-md-scan` call in `/sbin/metal-md-disks` is removed, it is no longer important shouldn't be invoked every loop that calls `/sbin/metal-md-disks`. - `metal-kdump.sh` no longer invokes `/sbin/metal-md-scan` under `root=kdump` because that script is already invoked by the initqueue (see `metal-genrules.sh`) - All initqueue calls to `metal-md-scan` have been changed to `--unique` and `--onetime` to ensure they never have an opportunity to run forever (as witnessed during a kdump test of the LiveCD) A note about the dependency on `mdraid-cleanup`: It turns out relying on `mdraid-cleanup` was a bad idea. The `mdraid-cleanup` script only stops RAIDs, it does not remove any superblock (or remove the RAIDs for that matter). This means that there is a (small) possibility that the RAID and its members still exist when the `partprobe` command fires. The window of time that this issue can occur is very small, and varies. VShasta has not hit this error in the 10-20 deployments it has done in the past 3-4 days.
979bf5a
to
ba0b119
Compare
4 tasks
532fb4f
to
2be44c7
Compare
Make all URLs printed by dracut-metal-mdsquash contain a commit hash. Remove the verbose `mount` and `umount` for pretterier output. Update the `README.adoc` file with a better/verbose explanation of the wipe process.
2be44c7
to
5e3b068
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary and Scope
virtio
#56Issue Type
This change ensures the RAIDs are eradicated and made unrecognizable by erasing their superblocks in order to resolve sync timing problems between Vshasta and metal.
The new logic explicitly stops the
md
devices, wipes their magic bits, and then eradicates themd
superblocks on each disk.During testing of VSHA-536 there was some fiddling with how the RAIDs were wiped to account for some peculiarities with the timings of how
virtio
synced and updated the kernel. The changes had been tested on metal without any observed problems, but in my recent series of tests some fatal inconsistencies were observed. Thepartprobe
was revealingmd
handles, this causedmdadm
to restart/resume RAIDs that had been "nuked" and this in turn caused partitioning to fail.This change also includes some minor fixes:
wipefs
command for sd/nvme devices was not getting piped to the log file./lib/metal-md-lib.sh
in a dracut shell is now left justified and aligned by colon./sbin/metal-md-scan
call in/sbin/metal-md-disks
is removed, it is no longer important shouldn't be invoked every loop that calls/sbin/metal-md-disks
.metal-kdump.sh
no longer invokes/sbin/metal-md-scan
underroot=kdump
because that script is already invoked by the initqueue (seemetal-genrules.sh
)metal-md-scan
have been changed to--unique
and--onetime
to ensure they never have an opportunity to run forever (as witnessed during a kdump test of the LiveCD)A note about the dependency on
mdraid-cleanup
:It turns out relying on
mdraid-cleanup
was a bad idea. Themdraid-cleanup
script only stops RAIDs, it does not remove any superblock (or remove the RAIDs for that matter). This means that there is a (small) possibility that the RAID and its members still exist when thepartprobe
command fires. The window of time that this issue can occur is very small, and varies. VShasta has not hit this error in the 10-20 deployments it has done in the past 3-4 days, my 50+ boots I tested didn't hit this, but the past 10 NCN boots I just attempted hit this almost every time.Prerequisites
Idempotency
Risks and Mitigations