Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMINST-5103 Better LVM logging #48

Merged
merged 2 commits into from
Aug 8, 2022
Merged

CASMINST-5103 Better LVM logging #48

merged 2 commits into from
Aug 8, 2022

Conversation

rustydb
Copy link
Contributor

@rustydb rustydb commented Jul 25, 2022

Summary and Scope

Issue Type

  • RFE Pull Request

CASMINST-5103 was difficult to debug because at the moment of failure there was no logging output available. Had the node booted in rd.debug or rd.info mode there might've been more. However we need to see more during the initial boot, and that boot does not run in rd.debug nor rd.info mode.

This change adds discovered volume groups and removed items to the
console, which would be very helpful to see if CASMINST-5103 were to occur again.

[   17.848377] dracut-initqueue[1490]:   Found volume group "metalvg0" using metadata type lvm2
[   17.908682] dracut-initqueue[1491]:   VG       #PV #LV #SN Attr   VSize   VFree
[   17.923919] dracut-initqueue[1491]:   metalvg0   1   3   0 wz--n- 279.14g 149.14g
[   17.944605] dracut-initqueue[1441]: Warning: removing all volume groups of name [vg_name=~ceph*]
[   17.964166] dracut-initqueue[1492]:   Failed to clear hint file.
[   18.028993] dracut-initqueue[1441]: Warning: removing all volume groups of name [vg_name=~metal*]
[   18.048173] dracut-initqueue[1514]:   Failed to clear hint file.
[   18.068130] dracut-initqueue[1514]:   Logical volume "CEPHETC" successfully removed
[   18.084107] dracut-initqueue[1514]:   Logical volume "CEPHVAR" successfully removed
[   18.104005] dracut-initqueue[1514]:   Logical volume "CONTAIN" successfully removed
[   18.120106] dracut-initqueue[1514]:   Volume group "metalvg0" successfully removed

Output from a clean boot on a worker node:

 Warning: local storage device wipe [ safeguard: DISABLED ]
 Warning: local storage device wipe commencing (USB devices are ignored) ...
 Warning: nothing can be done to stop this except one one thing ...
 Warning: ... power this node off within the next [5] seconds to prevent any and all operations ...
   Found volume group "metalvg0" using metadata type lvm2
   VG       #PV #LV #SN Attr   VSize   VFree
   metalvg0   1   1   0 wz--n- 279.14g 79.14g
 Warning: removing all volume groups of name [vg_name=~ceph*]
   Failed to clear hint file.
 Warning: removing all volume groups of name [vg_name=~metal*]
   Failed to clear hint file.
   Logical volume "CRAYS3CACHE" successfully removed
   Volume group "metalvg0" successfully removed
 Warning: local storage device wipe targeted devices: [/dev/sda /dev/md127 /dev/md124 /dev/md125 /dev/sdb /dev/sdc /dev/md126]
 Warning: local storage disk wipe complete
 mdadm: No arrays found in config file
 Found the following disks for the main RAID array (qty. [2]): [sdb sdc]
 mdadm: size set to 487360K
 mdadm: array /dev/md/BOOT started.
 mdadm: size set to 23908352K
 mdadm: array /dev/md/SQFS started.
 mdadm: size set to 146352128K
 mdadm: automatically enabling write-intent bitmap on large array
 mdadm: array /dev/md/ROOT started.
 mdadm: chunk size defaults to 512K
 mdadm: array /dev/md/AUX started.
 umount: /metal/ovaldisk unmounted
 Warning: Failed to ping URI host, pit ... (retry: 1)
 wicked: mgmt0: Request to acquire DHCPv4 lease with UUID 9064f162-98a1-0c00-fd07-000001000000
 wicked: mgmt0: Committed DHCPv4 lease with address 10.1.1.6 (lease time 1200 sec, renew in 600 sec, rebind in 1050 sec)

Output from a clean boot on a storage node:

Warning: local storage device wipe [ safeguard: DISABLED ]
Warning: local storage device wipe commencing (USB devices are ignored) ...
Warning: nothing can be done to stop this except one one thing ...
Warning: ... power this node off within the next [5] seconds to prevent any and all operations ...
  Found volume group "metalvg0" using metadata type lvm2
  Found volume group "ceph-d1846b1b-61f7-4fb2-b003-9dd47dc7775f" using metadata type lvm2
  Found volume group "ceph-8c71fcc4-2589-4bdd-ad2c-7f5f35a8ffdb" using metadata type lvm2
  Found volume group "ceph-55ee50a6-8130-468a-97cc-8c0d0589d2f8" using metadata type lvm2
  Found volume group "ceph-b9b6c6a9-c67a-4721-84d2-a777203c24a1" using metadata type lvm2
  VG                                        #PV #LV #SN Attr   VSize   VFree
  ceph-55ee50a6-8130-468a-97cc-8c0d0589d2f8   1   1   0 wz--n-  <1.75t      0
  ceph-8c71fcc4-2589-4bdd-ad2c-7f5f35a8ffdb   1   1   0 wz--n-  <1.75t      0
  ceph-b9b6c6a9-c67a-4721-84d2-a777203c24a1   1   1   0 wz--n-  <1.75t      0
  ceph-d1846b1b-61f7-4fb2-b003-9dd47dc7775f   1   1   0 wz--n-  <1.75t      0
  metalvg0                                    1   3   0 wz--n- 279.14g 149.14g
Warning: removing all volume groups of name [vg_name=~ceph*]
  Failed to clear hint file.
  Logical volume "osd-block-e3e0491c-c6c7-40a1-8ae0-4092c10e7834" successfully removed
  Volume group "ceph-d1846b1b-61f7-4fb2-b003-9dd47dc7775f" successfully removed
  Logical volume "osd-block-cbfdf774-b6c3-46b3-86e6-ec775fd3c331" successfully removed
  Volume group "ceph-8c71fcc4-2589-4bdd-ad2c-7f5f35a8ffdb" successfully removed
  Logical volume "osd-block-1e8099d5-923a-4cb5-a145-a868cd87eabc" successfully removed
  Volume group "ceph-55ee50a6-8130-468a-97cc-8c0d0589d2f8" successfully removed
  Logical volume "osd-block-1da85385-a2ff-46a3-825d-32ab0d35189f" successfully removed
  Volume group "ceph-b9b6c6a9-c67a-4721-84d2-a777203c24a1" successfully removed
Warning: removing all volume groups of name [vg_name=~metal*]
  Failed to clear hint file.
  Logical volume "CEPHETC" successfully removed
  Logical volume "CEPHVAR" successfully removed
  Logical volume "CONTAIN" successfully removed
  Volume group "metalvg0" successfully removed
Warning: local storage device wipe targeted devices: [/dev/sdb /dev/sdd /dev/sde /dev/sdf /dev/md126 /dev/md124 /dev/md127 /dev/sda /dev/sdc /dev/md125]
Warning: local storage disk wipe complete
mdadm: No arrays found in config file
Found the following disks for the main RAID array (qty. [2]): [sda sdc]
mdadm: size set to 487360K
mdadm: array /dev/md/BOOT started.
mdadm: size set to 23908352K
mdadm: array /dev/md/SQFS started.
mdadm: size set to 146352128K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: array /dev/md/ROOT started.
mdadm: chunk size defaults to 512K
mdadm: array /dev/md/AUX started.
umount: /metal/ovaldisk unmounted
Warning: Failed to ping URI host, pit ... (retry: 1)
wicked: mgmt0: Request to acquire DHCPv4 lease with UUID 9467f162-cfef-0d00-0006-000001000000
wicked: mgmt0: Committed DHCPv4 lease with address 10.1.1.11 (lease time 1200 sec, renew in 600 sec, rebind in 1050 sec)

Prerequisites

  • I have included documentation in my PR (or it is not required)
  • I tested this on internal system (if yes, please include results or a description of the test)
  • I tested this on a vshasta system (if yes, please include results or a description of the test)

Idempotency

Risks and Mitigations

@rustydb rustydb requested a review from a team as a code owner July 25, 2022 15:22
@rustydb
Copy link
Contributor Author

rustydb commented Jul 25, 2022

This could change to also fail out and link to the wipe document if volume groups still exist.

Maybe even just trigger the emergency shell with directions for continuing the boot afterwards.. but that's fancy.

@heemstra
Copy link
Contributor

This could change to also fail out and link to the wipe document if volume groups still exist.

Might be worth it. Otherwise, it'll continue to boot and the problem won't be obvious?

@rustydb rustydb changed the title CASMINST-5103 Better LVM logging WIP CASMINST-5103 Better LVM logging Jul 29, 2022
The pave function doesn't report to the console whether or not
any volume groups were found, let alone which logical volumes were
removed.

This change adds discovered volume groups and removed items to the
console.
@rustydb rustydb changed the title WIP CASMINST-5103 Better LVM logging CASMINST-5103 Better LVM logging Aug 5, 2022
@rustydb
Copy link
Contributor Author

rustydb commented Aug 5, 2022

I tested the failure commands on two nodes that were already booted:

  • One node had the volume groups, and the if conditional correctly stated that they exist.
  • One node had 0 volume groups, and the if conditional correctly evaluated to false and moved on.

This could use a full boot test though.

If any given volume group that was deleted still exists, fail the boot
with an error. In the error message mention next steps.
@rustydb rustydb merged commit 6a0fbaf into main Aug 8, 2022
@rustydb rustydb deleted the CASMINST-5103 branch August 8, 2022 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants