Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alloc_runner: wait when starting suspicious allocs #6216

Merged
merged 3 commits into from
Aug 28, 2019

Commits on Aug 27, 2019

  1. alloc_runner: wait when starting suspicious allocs

    This commit aims to help users running with clients suseptible to the
    destroyed alloc being restrarted bug upgrade to latest.  Without this,
    such users will have their tasks run unexpectedly on upgrade and only
    see the bug resolved after subsequent restart.
    
    If, on restore, the client sees a pending alloc without any other
    persisted info, then err on the side that it's an corrupt persisted
    state of an alloc instead of the client happening to be killed right
    when alloc is assigned to client.
    
    Few reasons motivate this behavior:
    
    Statistically speaking, corruption being the cause is more likely.  A
    long running client will have higher chance of having allocs persisted
    incorrectly with pending state.  Being killed right when an alloc is
    about to start is relatively unlikely.
    
    Also, delaying starting an alloc that hasn't started (by hopefully
    seconds) is not as severe as launching too many allocs that may bring
    client down.
    
    More importantly, this helps customers upgrade their clients without
    risking taking their clients down and destablizing their cluster. We
    don't want existing users to force triggering the bug while they upgrade
    and restart cluster.
    Mahmood Ali committed Aug 27, 2019
    Configuration menu
    Copy the full SHA
    cbc521e View commit details
    Browse the repository at this point in the history
  2. Alternative approach: avoid restoring

    This uses an alternative approach where we avoid restoring the alloc
    runner in the first place, if we suspect that the alloc may have been
    completed already.
    Mahmood Ali committed Aug 27, 2019
    Configuration menu
    Copy the full SHA
    493945a View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2019

  1. rename to hasLocalState, and ignore clientstate

    The ClientState being pending isn't a good criteria; as an alloc may
    have been updated in-place before it was completed.
    
    Also, updated the logic so we only check for task states.  If an alloc
    has deployment state but no persisted tasks at all, restore will still
    fail.
    Mahmood Ali committed Aug 28, 2019
    Configuration menu
    Copy the full SHA
    8b05f87 View commit details
    Browse the repository at this point in the history