Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate potential race conditions while rebooting #631

Merged
merged 4 commits into from
Jun 26, 2024

Conversation

cbgbt
Copy link
Contributor

@cbgbt cbgbt commented Jun 25, 2024

Issue number:
Closes #630

Description of changes:
This PR best left in commit-order, as it also performs several maintenance chores.

    agent: eliminate potential races while rebooting
    
    The agent had a few conditions under which it can enter the
    MonitoringUpdate state before it successfully uncordons the node:
    
    * Brupop updates its state into Rebooted before the reboot terminates
      the process
    * After rebooting, an error ocurrs after updating the state but before
      performing the uncordon.
    
    We avoid this conditions by:
    * Exiting the agent process in cases where the desired state is Rebooted
      but we are not yet running the new version.
    * Always uncordoning the node before marking that we have successfully
      transitioned into the Rebooted state.
    * Defensively uncordoning the node once again when we enter the
      Monitoring state.

Testing done:
I built a custom Bottlerocket variant which always waits for a minute before performing a reboot, then used this to test these changes. In doing this, I was able to reliably trigger the race condition, and then also confirm that this new code does not misbehave when reboots occur slowly.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Several keys have been deprecated in favor of the same default values we
were specifying before. By removing them and adding `version = 2` under
licenses, we will maintain the same behavior but without the deprecation
notices that we being printed.
The agent had a few conditions under which it can enter the
MonitoringUpdate state before it successfully uncordons the node:

* Brupop updates its state into Rebooted before the reboot terminates
  the process
* After rebooting, an error ocurrs after updating the state but before
  performing the uncordon.

We avoid this conditions by:
* Exiting the agent process in cases where the desired state is Rebooted
  but we are not yet running the new version.
* Always uncordoning the node before marking that we have successfully
  transitioned into the Rebooted state.
* Defensively uncordoning the node once again when we enter the
  Monitoring state.
@cbgbt cbgbt merged commit bcc413b into bottlerocket-os:develop Jun 26, 2024
2 checks passed
@cbgbt cbgbt deleted the reboot-racer branch June 26, 2024 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

brupop agent (sometimes) leaving nodes unschedulable
3 participants