Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track plan rejection history and automatically mark clients as ineligible #13421

Merged
merged 14 commits into from
Jul 12, 2022

Commits on Jun 18, 2022

  1. Configuration menu
    Copy the full SHA
    febcf30 View commit details
    Browse the repository at this point in the history
  2. core: track and act on node plan rejections

    Plan rejections occur when the scheduler work and the leader plan
    applier disagree on the feasibility of a plan. This may happen for valid
    reasons: since Nomad does parallel scheduling, it is expected that
    different workers will have a different state when computing placements.
    
    As the final plan reaches the leader plan applier, it may no longer be
    valid due to a concurrent scheduling taking up intended resources. In
    these situations the plan applier will notify the worker that the plan
    was rejected and that they should refresh their state before trying
    again.
    
    In some rare and unexpected circumstances it has been observed that
    workers will repeatedly submit the same plan, even if they are always
    rejected.
    
    While the root cause is still unknown this mitigation has been put in
    place. The plan applier will now track the history of plan rejections
    per client and include in the plan result a list of node IDs that should
    be set as ineligible if the number of rejections in a given time window
    crosses a certain threshold. The window size and threshold value can be
    adjusted in the server configuration.
    lgfa29 committed Jun 18, 2022
    Configuration menu
    Copy the full SHA
    e343ca6 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    5beb6a1 View commit details
    Browse the repository at this point in the history
  4. changelog: addn entry for #13421

    lgfa29 committed Jun 18, 2022
    Configuration menu
    Copy the full SHA
    2db5edc View commit details
    Browse the repository at this point in the history

Commits on Jul 7, 2022

  1. apply code review suggestions

    lgfa29 committed Jul 7, 2022
    Configuration menu
    Copy the full SHA
    ff15de9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    67f42b8 View commit details
    Browse the repository at this point in the history
  3. fix tests

    lgfa29 committed Jul 7, 2022
    Configuration menu
    Copy the full SHA
    98ad0d7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    620b413 View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2022

  1. Configuration menu
    Copy the full SHA
    27e767d View commit details
    Browse the repository at this point in the history
  2. core: refactor plan rejection tracker

    Simplify the interface for `BadNodeTracker` by merging the methods `Add`
    and `IsBad` since they are always called in tandem and reduce the number
    and level of log messages generated. Also cleanup expired records to
    avoid inifinite growth the cache entry never expires.
    
    Take explicit timestamp to make tests faster and more reliable.
    lgfa29 committed Jul 8, 2022
    Configuration menu
    Copy the full SHA
    ff8f670 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    bd833f9 View commit details
    Browse the repository at this point in the history

Commits on Jul 12, 2022

  1. core: use stable time FSM operation

    Set the timestamp for a plan apply operation at request time to avoid
    non-deterministic operations in the FSM.
    lgfa29 committed Jul 12, 2022
    Configuration menu
    Copy the full SHA
    fb2e761 View commit details
    Browse the repository at this point in the history
  2. config: use pointer for plan_rejection_tracker.enabled

    Using a pointer allow us to differentiate between a non-set value and an
    explicit `false` if we decide to use `true` by default.
    lgfa29 committed Jul 12, 2022
    Configuration menu
    Copy the full SHA
    f5936f0 View commit details
    Browse the repository at this point in the history
  3. test: fix pointer dereference

    lgfa29 committed Jul 12, 2022
    Configuration menu
    Copy the full SHA
    b243d7d View commit details
    Browse the repository at this point in the history