Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kdump support #1596

Merged
merged 6 commits into from
Jun 14, 2021
Merged

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented May 22, 2021

Issue number:
#1413

Description of changes:

3883f7f7 os: Add kdump support
842604d7 kernel-5.4: add patches required for kdump support
c78ec8a2 systemd: move systemd mounts to preconfigured.target
1e0f6859 packages: add makedumpfile
1466a49f packages: add libelf
bcc5c903 packages: add kexec-tools

Kdump is a Linux feature that allows to boot to a kernel whenever the system panics. The crash kernel is loaded into a reserved space in memory determined by the crashkernel kernel parameter. In Bottlerocket, this parameter is set such that no memory will be reserved if the host has less than 2GB of memory.

For Bottlerocket, the crash kernel is loaded from the current active boot partition. The configure-boot-mount.service systemd unit determines which is the current active boot partition and mounts it at /boot. This mount is set as read-only and with private propagations, so new mount namespaces won't have access to it. As part of this change, SELinux labels are added to the boot partition when it is created by the rpm2img tool.

The load-crash-kernel.service systemd unit loads the crash kernel, only if memory was reserved for it, and the kexec.kexec_load_disable setting is 0. The unit will exit gracefully if no memory was reserved for the crash kernel. For the moment only the aws-dev and vmware variants use that kernel parameter.

The kexec.kexec_load_disable setting used to be set in the sysctl.conf configuration file. With this change, the setting is set using the disable-kexec-load.service systemd unit. This unit runs after load-crash-kernel.service, even if the latter wasn't executed or excited with a non-zero code.

The capture-kernel-dump.service systemd unit is set as the target when the crash kernel is executed. It captures both the dmesg logs and the kdump-compressed dump excluding:
* Pages filled with zero
* Non-private cache pages
* All cache pages
* User process data pages
* Free pages

All the files generated by the capture-kernel-dump.service unit are stored at /var/log/kdump, therefore the unit has a strong dependency on the following services to setup the persistent partition:

  • local-fs.target
  • systemd-sysusers.service
  • systemd-udevd.service
  • systemd-udev-trigger.service
  • systemd-tmpfiles-setup.service
  • systemd-tmpfiles-setup-dev.service

Since local-fs.target is a dependency of capture-kernel-dump.service, systemd will attempt to load all the mount units. To prevent this, the mount units will only be loaded during the execution of the preconfigured target.

No API is provided to enable/disable the dump collection, since the memory space is reserved and it will be a waste if nothing uses that space. Dynamically changing the crashkernel cmd line parameter isn't an option since we will provide support for secure boot in the future.

Testing done:
aws-dev x86_64/aarch64, vmware-dev/vmware-k8s-1.20 x86_64:

  • systemctl status didn't show failed units
  • Crashed the kernel with echo c > /proc/sysrq-trigger, and verified that the dumps/logs were generated
  • Verified that existing dump files were deleted
  • Verified that the correct active boot partition was mounted

k8s variant x86_64:

  • I did a stress test to verify that 256MB of memory are enough to collect the kernel dumps, regardless of the size of the host and the operation workload it has. I used a custom build for the aws-k8s-1.20 variant for this test and a m5.8xlarge EC2 instance with 110 pods running that only load random data and keep it there forever.
  • During the same test, max out the number of volumes that can be attached to an instance, since udev was causing OOM errors before I limited the number of child processes that it can have. I verified that the dumps were generated properly during these tests:
[ec2-user@ip-192-168-0-174 log]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           122G        112G        7.2G         18M        2.8G        8.9G
Swap:            0B          0B          0B
-rw-------. 1 root root  59K May 21 00:14 dmesg.dump
-rw-------. 1 root root 790M May 21 00:14 kdump.dump
-rw-r--r--. 1 root root   77 May 21 00:14 prairie-dog.log

aws-ecs-1, aws-k8s-1.19 x86_64:

  • systemctl status didn't show failed units
  • Run nginx task/pod
  • Validated that kexec.service wasn't executed:
● kexec.service - Load crash kernel
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/kexec.service; enabled; vendor preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Tue 2021-06-01 21:38:07 UTC; 17min ago
             └─ ConditionKernelCommandLine=crashkernel was not met

Jun 01 21:38:04 localhost systemd[1]: Condition check resulted in Load crash kernel being skipped.
Jun 01 21:38:07 ip-192-168-72-37.us-west-2.compute.internal systemd[1]: Condition check resulted in Load crash kernel being skipped.
Jun 01 21:38:07 ip-192-168-72-37.us-west-2.compute.internal systemd[1]: Condition check resulted in Load crash kernel being skipped.

Custom aws-k8s-1.19 x86_64 build with crashkernel, to validate the 5.4 kernel behavior:

  • systemctl status didn't show failed units
  • Run nginx pod
  • Crashed the kernel with echo c > /proc/sysrq-trigger, and verified that the dumps/logs were generated

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@arnaldo2792
Copy link
Contributor Author

  • Update the URL for libelf's spec file

@arnaldo2792
Copy link
Contributor Author

  • Remove leftover systemd service

packages/kexec-tools/Cargo.lock Outdated Show resolved Hide resolved
packages/kexec-tools/Cargo.toml Outdated Show resolved Hide resolved
packages/kexec-tools/kexec-tools.spec Outdated Show resolved Hide resolved
packages/kexec-tools/kexec-tools.spec Outdated Show resolved Hide resolved
packages/kexec-tools/kexec-tools.spec Outdated Show resolved Hide resolved
sources/prairie-dog/src/main.rs Outdated Show resolved Hide resolved
sources/prairie-dog/src/main.rs Outdated Show resolved Hide resolved
packages/kexec-tools/kexec-tools.spec Outdated Show resolved Hide resolved
Comment on lines 298 to 299
// Load the panic kernel from `BOOT_MOUNT_PATH`, letting kexec decide which syscall
// it should it use (KEXEC_LOAD and KEXEC_FILE_LOAD).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we force it to always use KEXEC_FILE_LOAD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can, but for some reason the 5.4 kernel fails to use the kexec_file_load syscall, and it returns an ENOSUP error. The problem is not in the kexec library but rather in the kernel, so I need to debug the syscall to check what's returning the error. I'm already working with the kernel folks on this.

In the 5.10 kernel, kexec always uses kexec_file_load, so I think it is OK to use -a for now and let kexec decide what syscall to use, since we are only enabling kdump in variants that use the 5.10 kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we expect to enable this in other variants relatively soon? I imagine people building their own images/variants might want to enable it as well. Wouldn't want a nasty surprise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we enable this for other variants, and people build their own images/variants, they will still be locked with the kexec_load syscall for the aarch64 5.4 kernel variants since the syscall for this aarch64 was implemented after the 5.4 version.

sources/prairie-dog/src/main.rs Outdated Show resolved Hide resolved
@arnaldo2792
Copy link
Contributor Author

Changes in force push:

  • Update kexec-tools, makedumpfile, and libelf to latest release
  • Rename prairie-dog to prairiedog
  • Remove conditionals in os.spec, and build all the added packages regardless of the variant
  • Add the disable-kexec-load.service systemd unit to always disable kexec* syscalls regardless of the state of kexec.service
  • Only execute kexec.service if there is memory reserved for the crash kernel
  • Update cmdline parameters for crash kernel, and use the same parameters as Amazon Linux 2
  • Add irqpoll to the cmdline for the crash kernel, depending on the target architecture
  • Move unnecessary mounts from local-fs.target to preconfigured.target

@arnaldo2792
Copy link
Contributor Author

Force push updates:

  • Gracefully exit while loading the crash kernel when no memory was reserved for the crash kernel

@arnaldo2792 arnaldo2792 marked this pull request as ready for review June 2, 2021 23:12
@arnaldo2792 arnaldo2792 requested a review from bcressey June 3, 2021 00:24
@arnaldo2792 arnaldo2792 linked an issue Jun 4, 2021 that may be closed by this pull request
packages/libelf/libelf.spec Outdated Show resolved Hide resolved
sources/prairiedog/src/main.rs Outdated Show resolved Hide resolved
packages/os/kdump-tmpfiles.conf Outdated Show resolved Hide resolved
packages/os/configure-boot-mount.service Outdated Show resolved Hide resolved
sources/prairiedog/src/main.rs Outdated Show resolved Hide resolved
sources/prairiedog/src/main.rs Show resolved Hide resolved
Comment on lines 240 to 243
// Load the panic kernel from `BOOT_MOUNT_PATH`, letting kexec decide which syscall
// it should it use (KEXEC_LOAD and KEXEC_FILE_LOAD). We will use this setting until
// we figure out why kexec doesn't recognize the 5.4 kernel image as a valid file
// to be used with KEXEC_FILE_LOAD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're planning to merge this prior to a fix, let's open an issue to track it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got to port the kernel patch to make the kexec_file_load syscall work in the x86_64 5.4 kernel. However, I'm still waiting for the kernel folks to help me out with porting the syscall for aarch64, since the syscall was introduced for aarch64 on a higher kernel version.

I talked to @tjkirch since he expressed his concerns of using different syscalls depending on the kernel version/architecture. We agreed that we can proceed with my PR provided that we document in an GH issue that we will force the kexec_file_load syscall once the kernel folks add support for it for the 5.4 kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document the restrictions in the same place we document the feature, not (only) in a GH issue.

sources/prairiedog/src/main.rs Outdated Show resolved Hide resolved
sources/prairiedog/src/main.rs Outdated Show resolved Hide resolved
@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • 5.4 kernel patch to use kexec_file_load with the 5.4 kernel
  • makedumpfile patch renamed
  • Renamed kexec.service, kdump.service and configure-boot-mount.service to load-crash-kernel.service, capture-kernel-dump.service, and prepare-boot.service
  • Renamed prairiedog command to prepare-boot.service
  • Moved all systemd services created in this PR to the release package
  • Label boot partition in rpm2img
  • Narrow down the libraries include in the libelf-develpackage

@arnaldo2792 arnaldo2792 requested a review from bcressey June 9, 2021 23:38
This commit adds kexec-tools to all variants
packages/os/kdump-tmpfiles.conf Outdated Show resolved Hide resolved
packages/release/release.spec Show resolved Hide resolved
This commit adds libelf to all variants
@bcressey bcressey self-requested a review June 10, 2021 00:21
@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Remove unnecessary patch in the 5.4 kernel that caused failures in the aarch64 variants' build
  • Update commit messages

This commit adds makedumpfile to all variants
This commit moves some of the systemd mount units to be part of
`preconfigured` instead of the `local-fs`. This is to reduce the
overhead of mounting unnecesary mount endpoints during the execution of
the crash kernel.
@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Added documentation to README
  • Fixed makedumpfile patch
  • Addressed suggested nit changes

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@arnaldo2792
Copy link
Contributor Author

Forced push include fixes in README

@arnaldo2792 arnaldo2792 requested review from webern, jahkeup and jpculp June 11, 2021 20:53
README.md Outdated
There area few important caveats about the provided kdump support:

* Currently, only vmware variants have kdump support enabled
* The system kernel will reserve 256M for the crash kernel, only when the host has at least 2GB of memory; the reserved space won't be available for processes running in the host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. I would change 256M to 256MB to be consistent with "at least 2GB of memory".

@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Patches to make the kexec_file_load syscall work in arm64
  • Fixed nit in README
  • Force the kexec_file_load syscall in prairedog

@arnaldo2792 arnaldo2792 requested review from bcressey and jpculp June 14, 2021 21:18
README.md Outdated
Bottlerocket provides support to collect kernel crash dumps whenever the system kernel panics.
Once this happens, both the dmesg log and vmcore dump are stored at `/var/log/kdump`, and the system reboots.

There area few important caveats about the provided kdump support:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There area few important caveats about the provided kdump support:
There are a few important caveats about the provided kdump support:

Kdump is a Linux feature that allows to boot to a kernel whenever the
system panics. The crash kernel is loaded into a reserved space in
memory determined by the `crashkernel` kernel parameter. In
Bottlerocket, this parameter is set such that no memory will be reserved
if the host has less than 2GB of memory.

For Bottlerocket, the crash kernel is loaded from the current active
boot partition. The `configure-boot-mount.service` systemd unit
determines which is the current active boot partition and mounts it at
`/boot`. This mount is set as read-only and with private propagations,
so new mount namespaces won't have access to it. As part of this change,
SELinux labels are added to the boot partition when it is created by the
`rpm2img` tool.

The `load-crash-kernel.service` systemd unit loads the crash kernel,
only if memory was reserved for it, and the `kexec.kexec_load_disable`
setting is `0`. The unit will exit gracefully if no memory was reserved
for the crash kernel. For the moment only the aws-dev and vmware variants
use that kernel parameter.

The `kexec.kexec_load_disable` setting used to be set in the
`sysctl.conf` configuration file. With this change, the setting is set
using the `disable-kexec-load.service` systemd unit. This unit runs after
`load-crash-kernel.service`, even if the latter wasn't executed or
excited with a non-zero code.

The `capture-kernel-dump.service` systemd unit is set as the target when
the crash kernel is executed. It captures both the dmesg logs and the
kdump-compressed dump excluding:
  * Pages filled with zero
  * Non-private cache pages
  * All cache pages
  * User process data pages
  * Free pages

All the files generated by the `capture-kernel-dump.service` unit are
stored at `/var/log/kdump`, therefore the unit has a strong dependency
on the following services to set up the persistent partition:
  * local-fs.target
  * systemd-sysusers.service
  * systemd-udevd.service
  * systemd-udev-trigger.service
  * systemd-tmpfiles-setup.service
  * systemd-tmpfiles-setup-dev.service

Since `local-fs.target` is a dependency of `capture-kernel-dump.service`,
systemd will attempt to load all the mount units. To prevent this, the
mount units will only be loaded during the execution of the `preconfigured`
target.

No API is provided to enable/disable the dump collection, since the
memory space is reserved and it will be a waste if nothing uses that
space. Dynamically changing the `crashkernel` cmd line parameter isn't
an option since we will provide support for secure boot in the future.
@arnaldo2792
Copy link
Contributor Author

Forced pushed includes nit fix

@arnaldo2792 arnaldo2792 merged commit 36f3720 into bottlerocket-os:develop Jun 14, 2021
@arnaldo2792 arnaldo2792 deleted the kdump-support branch June 16, 2021 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add kdump support
4 participants