Skip to content

Dracut module for providing persistent storage and redundancy for booting squashFS images.

License

Notifications You must be signed in to change notification settings

Cray-HPE/dracut-metal-mdsquash

Repository files navigation

METAL 90mdsquash - redundant squashFS and overlayFS storage

This repository hosts a dracut module for creating a local overlayFS that provides persistent storage for a Live image.

The created overlayFS is customizable, by default the overlayFS is built on a mirrored MD RAID.

This module assumes authority and will wipe and partition a server from within the initramFS.

Additionally, the partitioned RAID array will contain an empty LVM, this LVM is useful for allowing processes such as cloud-init to further customize once the server boots.

For more information on Dracut, the initramFS creation tool, see here: https://github.com/dracutdevs/dracut

Requirements

In order to use this dracut module, you need:

  1. A local-attached-usb or remote server with a squashFS image.

  2. Physical block devices must be installed in your blade(s).

  3. Two physical disk of 0.5TiB or less (or the RAID must be overridden, see metal mdsquash customization

Usage

Specify a local disk device for storing images, and the URL/endpoint to fetch images from on the kernel commandline (e.g. via grub.cfg, or an iPXE script).

metal.server=<URI> root=live:LABEL=SQFSRAID

The above snippet is the minimal cmdline necessary for this module to function. Additional options are denoted throughout the module customization section.

URI Drivers

The URI scheme, authority, and path values will tell the module where to look for SquashFS. Only those parts of a URI are supported.

  • file driver

    # Example: Load from another USB stick on the node.
    file://mydir?LABEL=MYUSB
    # Example: load a file from the root of an attached disk with the label SQFSRAID
    file://?LABEL=SQFSRAID
    # Example: load a file from a PIT recovery USB.
    file://ephemeral/data/ceph/?LABEL=PITDATA
  • http or https driver

    http://api-gw-service/path/to/service/endpoint
    http://pit/
    http://10.100.101.111/some/local/server

Other drivers, such as native s3, scp, and ftp could be added but are not currently implemented.

These drivers schemes are all defined by the rule generator, metal-genrules.sh.

Kernel Parameters

metal-mdsquash customizations

metal.debug

Set metal.debug=1 to enable debug output from only metal modules.This will verbosely print the creation of the RAIDs and fetching of the squashFS image. This effectively runs all dracut-metal code with set -x, while leaving the rest of dracut to its own debug level.

  • Default: 0

metal.disks

Specify the number of disks to use in the local RAID (see metal.md-level for changing the RAID type).

  • Default: 2

metal.md-level

Change the level passed to mdadm for RAID creation, possible values are any value it takes. Milaege varies, buyer beware this could dig a hole deeper.

  • Default: mirror

Note
When metal.disks=1 is set, a RAID array is still created but with only one member. In this case, only mirror and stripe will produce

metal.no-wipe

Determines if the wipe function should run, metal.no-wipe=0 will wipe block devices and make them ready for partitioning. metal.no-wipe=1 will disable this behavior.

  • Default: 0

Note that a warning will print with a timeout in which the user may power the node off to avoid a wipe. This timeout can be adjusted, see metal.wipe-delay.

The following storage items are removed and/or prepared for partitioning as a raw disk:

  1. LVMs (specifically 'vg_name=~ceph*' and 'vg_name=~metal*')

    • This removes any CEPH volumes

    • Any volume prefixed with metal is considered a relative to this module and will be removed

      • Volumes are removed with vgremove

  2. /dev/md devices

    • MD Devices are stopped

    • Magic bits erased

    • Each memeber’s superblocks are zeroed

  3. /dev/sd and /dev/nvme devices

    • Magic bits erased

  4. Any/all USB devices are ignored

  5. Any/all devices smaller than metal.min-disk-size*1024**3 bytes is ignored (see metal.min-disk-size)

  6. partprobe is invoked to update/notify the kernel of the partition changes

  7. Any LVMs that weren’t on a device that was wiped will still exist, since only specific LVMs are targeted

Example output of a wipe running
Warning: local storage device wipe [ safeguard: DISABLED ]
Warning: local storage devices WILL be wiped (https://github.com/Cray-HPE/dracut-metal-mdsquash/tree/7d303b3193619f642b1316ce2b1968ee1cc82a69#metalno-wipe)
Warning: local storage device wipe commencing ...
Warning: local storage device wipe ignores USB devices and block devices less then or equal to [17179869184] bytes.
Warning: nothing can be done to stop this except one one thing ...
Warning: power this node off within the next [5] seconds to cancel.
Warning: NOTE: this delay can be adjusted, see: https://github.com/Cray-HPE/dracut-metal-mdsquash/tree/7d303b3193619f642b1316ce2b1968ee1cc82a69#metalwipe-delay)
  Found volume group "metalvg0" using metadata type lvm2
  Found volume group "ceph-ec4a2c46-e0ab-4f89-b7dc-6c044ce9a24b" using metadata type lvm2
  Found volume group "ceph-2c5c9402-7bc2-4a8c-8eba-028532b91d9f" using metadata type lvm2
  Found volume group "ceph-a38bb9f7-99ef-4536-82cf-2550a406da38" using metadata type lvm2
  Found volume group "ceph-c1e6018e-6a50-4b17-a15d-b387ae66b8a4" using metadata type lvm2
  VG                                        #PV #LV #SN Attr   VSize    VFree
  ceph-2c5c9402-7bc2-4a8c-8eba-028532b91d9f   1   1   0 wz--n-   <1.75t      0
  ceph-a38bb9f7-99ef-4536-82cf-2550a406da38   1   1   0 wz--n-   <1.75t      0
  ceph-c1e6018e-6a50-4b17-a15d-b387ae66b8a4   1   1   0 wz--n- <447.13g      0
  ceph-ec4a2c46-e0ab-4f89-b7dc-6c044ce9a24b   1   1   0 wz--n-   <1.75t      0
  metalvg0                                    1   3   0 wz--n-  279.14g 149.14g
Warning: removing all volume groups of name [vg_name=~ceph*]
  Failed to clear hint file.
  Logical volume "osd-block-a8c05059-d921-4546-884d-f63f606f966c" successfully removed
  Volume group "ceph-ec4a2c46-e0ab-4f89-b7dc-6c044ce9a24b" successfully removed
  Logical volume "osd-block-d70a9ddd-9b8c-42e0-98cb-5f5279dcef5a" successfully removed
  Volume group "ceph-2c5c9402-7bc2-4a8c-8eba-028532b91d9f" successfully removed
  Logical volume "osd-block-d2e9e4cf-c670-418f-847e-39ade3208d04" successfully removed
  Volume group "ceph-a38bb9f7-99ef-4536-82cf-2550a406da38" successfully removed
  Logical volume "osd-block-b6085667-54dc-4e01-810b-25c093a510dc" successfully removed
  Volume group "ceph-c1e6018e-6a50-4b17-a15d-b387ae66b8a4" successfully removed
Warning: removing all volume groups of name [vg_name=~metal*]
  Failed to clear hint file.
  Logical volume "CEPHETC" successfully removed
  Logical volume "CEPHVAR" successfully removed
  Logical volume "CONTAIN" successfully removed
  Volume group "metalvg0" successfully removed
Warning: local storage device wipe is targeting the following RAID(s): [/dev/md124 /dev/md125 /dev/md126 /dev/md127]
Warning: local storage device wipe is targeting the following block devices: [/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf]
Warning: local storage disk wipe complete
Found the following disks for the main RAID array (qty. [2]): [sda sdb]
mdadm: size set to 487360K
mdadm: array /dev/md/BOOT started.
mdadm: size set to 23908352K
mdadm: array /dev/md/SQFS started.
mdadm: size set to 146352128K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: array /dev/md/ROOT started.
mdadm: chunk size defaults to 512K
mdadm: array /dev/md/AUX started.

metal.wipe-delay

The number of seconds that the wipe function will wait to allow an administrator to cancel it (by powering the node off).See the source code in metal-md-lib.sh for minimum and maximum values.

  • Default: 5

  • Unit: Seconds

metal.ipv4

By default, metal-dracut will use IPv4 to resolve the deployment server for the initial call-to-home and when downloading artifacts regardless if IPv6 networking is present in the environment. This is to safeguard against fault/misconfigured IPv6 environments.

To disable this constraint, simply set metal.ipv4=0 in the cmdline. Setting 0 will enable IPv6 for this module.

  • Default: 1

metal.sqfs-md-size

Set the size for the new SQFS partition. Buyer beware this does not resize, this applies for new partitions.

  • Default: 25

  • Unit: Gigabytes

metal.oval-md-size

Set the size for the new SQFS partition. Buyer beware this does not resize, this applies for new partitions.

  • Default: 150

  • Unit: Gigabytes

metal.aux-md-size

Set the size for the new SQFS partition. Buyer beware this does not resize, this applies for new partitions.

  • Default: 150

  • Unit: Gigabytes

metal.min-disk-size

Sets the minimum size threshold when wiping and partitioning disks, anything less than or equal is this left untouched.

  • Default: 16

  • Unit: Gigabytes

The value is converted to bytes (metal.min-disk-size*1024**3), all comparisons are done in this unit.

dmsquashlive customizations

rd.live.dir

Name of the directory store and load the artifacts from. Changing this value will affect metal and native-dracut.

  • Default: LiveOS

root

Specify the FSlabel of the block device to use for the SQFS storage. This could be an existing RAID or non-RAIDed device. If a label is not found in /dev/disk/by-label/*, then the os-disks are paved with a new mirror array. Can also be of UUID or

  • Default: live:LABEL=SQFSRAID

rd.live.overlay

Specify the FSlabel of the block device to use for persistent storage. If a label is not found in /dev/disk/by-label/*, then the os-disks are paved. If this is specified, then rd.live.overlay=LABEL=<new_label_here> must also be specified.

  • Default: LABEL=ROOTRAID

rd.live.overlay.readonly

Make the persistent overlayFS read-only.

  • Default: 0

rd.live.overlay.reset

Reset the persistent overlayFS, regardless if it is read-only. On the next boot the overlayFS will clear itself, it will continue to clear itself every reboot until this is unset. This does not remake the RAID, this remakes the OverlayFS. Metal only provides the underlying array, and the parent directory structure necessary for an OverlayFS to detect the array as compatible.

  • Default: 0

rd.live.overlay.size

Specify the size of the overlay in MB.

  • Default: 204800

rd.live.squashimg

Specify the filename to refer to download.

  • Default: rootfs

dracut : standard customizations

rootfallback

This the label for the partition to be used for a fallback bootloader.

  • Default: LABEL=BOOTRAID

RootFS and the Persistent OverlayFS

What is a Persistent Overlay?

The idea of persistence is that changes persist across reboots, when the state of the machine changes it preserves information. For servers that boot images into memory (also known as live images), an overlayFS is a common method for providing persistent storage.

The overlayFS created by this dracut module is used by the dmsquash-live module, all dracut live image kernel parameters should function alongside this module.