-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fixes to improve boot speed #1809
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One super minor suggestion, but otherwise LGTM!
packages/release/opt-cni-bin.mount
Outdated
What=overlay | ||
Where=/opt/cni/bin | ||
Type=overlay | ||
Options=noatime,nosuid,nodev,lowerdir=/usr/libexec/cni/bin,upperdir=/opt/cni/upper,workdir=/opt/cni/work,context=system_u:object_r:local_t:s0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding another directory here to contain the upperdir and workdir to hide them/make it more obvious that they're an implementation detail of the overlay mount? Something like /opt/cni/.overlay/upper
and /opt/cni/.overlay/work
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to use /var/lib/cni-plugins
for the overlay directories, partly to match the treatment of /var/lib/kernel-devel
(which I also adjusted along these lines), and partly to guard against cases where pods might be mounting in /opt/cni
and get confused by the new directories.
Rebase; fix the serial console speed commit to account for the removed |
If they're built in, they can delay mounting the root filesystem. Signed-off-by: Ben Cressey <bcressey@amazon.com>
This disables most of the Docker-related functionality, and avoids a five second delay at startup waiting for the Docker daemon. Signed-off-by: Ben Cressey <bcressey@amazon.com>
We use tmpfiles extensively, and the additional output gives a more complete picture of what happens each boot. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Move the upper, lower, and work directories for the writable kernel development tree into a subdirectory, to better indicate their status as an implementation detail for the overlayfs mount. Signed-off-by: Ben Cressey <bcressey@amazon.com>
This speeds up boot by avoiding the need to copy the binaries to the local storage volume. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Otherwise these units show up as some of the longest running jobs in `systemd-analyze blame` output. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Signed-off-by: Ben Cressey <bcressey@amazon.com>
The wicked daemons will wait for expected devices to appear, which is more reliable than relying on `udevadm settle` and avoids unnecessary boot delays. Signed-off-by: Ben Cressey <bcressey@amazon.com>
We use a one second defer timeout for the DHCPv6 lease essentially to mark it as optional and minimize the boot delay. One second is longer than we would like already, but going sub-second is somewhat invasive because the timeouts are tied to the protocol implementation and can change the client behavior. It's relatively simple to avoid the extra wait caused by an early timer event. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Existing variant platforms all support the 115200 speed for the guest serial device. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Any use of RAID is left up to containers to handle. Signed-off-by: Ben Cressey <bcressey@amazon.com>
Adjust overlay directory handling per @samuelkarp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue number:
N/A
Description of changes:
This is a collection of fixes to improve boot speed and time to a usable node - at least by 5 seconds, at most by 8 seconds.
Building
kubelet
with the "dockerless" tag saves 5 seconds during service startup, as otherwise cadvisor tries for five seconds to connect to the Docker daemon before printing an error.The fix for the defer timeout in the wicked DHCPv6 client saves 1 second for around half of launches, in cases where the timer fires a little early and would otherwise trigger another 1 second wait.
Using an overlayfs for the CNI plugin directory saves a variable amount of time by avoiding a potentially slow copy to an unwritten EBS volume in the critical path.
systemd-tmpfiles-setup
previously took 900 milliseconds or more in most cases, and now takes 100 milliseconds or less, with most of the remaining time spent populating the SELinux modules in/var/lib/selinux
.Building support for the PS/2 controller, keyboard, and mouse as modules saves around 400 milliseconds during boot under KVM, as otherwise device mapper waits for the configuration to finish before mounting the root filesystem. They are still loaded later, after the root filesystem is mounted, but at that point we can do more work in parallel.
Disabling RAID auto-detect avoids another potential device wait and reduces
printk
messages. Writing to the console device at 115200 bits per second speeds up those operations by 12x. Console logging continues to be a drag on overall boot speed. We can turn it off altogether to gain at least 2 seconds, but only at a severe cost to debugging capabilities if anything goes wrong. Using the higher device speed obviously helps, but its impact is spread across all threads that might draw the short straw after triggering aprintk
call, and is difficult to quantify.Removing the
udevadm settle
dependency doesn't yield a measurable improvement in boot speed, but does stopsystemd
from blamingwicked
for slowing everything down.I've kept the two commits that added debug output for
systemd-tmpfiles
and the wicked clients, since these were instrumental in identifying the underlying issues and confirming the fixes. These logs are all sent to the journal rather than the console, so they don't compete with existing output or slow down the boot.Testing done:
For the kernel change: verified that the keyboard and mouse modules were still loaded on x86_64 nodes.
For the changes to kubelet and the CNI plugins directory: verified that sonobuoy runs passed for these versions, and that no Docker related error messages were logged to the journal.
For the "activate" targets: confirmed that these were no longer blamed by
systemd-analyze blame
, and that bootstrap containers still worked as expected.For the wicked changes: confirmed that the DHCP6 client would defer after the first timeout, whether the timer fired slightly before or slightly after one second elapsed. On instances with DHCP6 enabled, the lease was successfully acquired.
For the udev settle change: used a hacked up local build where wicked was set up to manage "eth1" rather than "eth0", and verified that wicked would still configure the device if I renamed it into existence during the wait.
For the serial console changes: verified that console logs were present for AWS variants across a range of instance types - c1.xlarge, t2.large, m3.2xlarge, c3.large, c4.large, c5.large, c6g.large - and for VMware variants running on ESXi 7.0. Note that we're already using 115200 for GRUB as of #1701, so this setting has previously been validated on a smaller set of instance types.
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.