Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix reconciliation of reconnecting allocs #16609

Merged
merged 7 commits into from
Mar 24, 2023

Commits on Mar 22, 2023

  1. Configuration menu
    Copy the full SHA
    7e334ff View commit details
    Browse the repository at this point in the history
  2. scheduler: fix reconciliation of reconnecting allocs

    When a disconnect client reconnects the `allocReconciler` must find the
    allocations that were created to replace the original disconnected
    allocations.
    
    This process was being done in only a subset of non-terminal untainted
    allocations, meaning that, if the replacement allocations were not in
    this state the reconciler didn't stop them, leaving the job in an
    inconsistent state.
    
    This inconsistency is only solved in a future job evaluation, but at
    that point the allocation is considered reconnected and so the specific
    reconnection logic was not applied, leading to unexpected outcomes.
    
    This commit fixes the problem by running reconnecting allocation
    reconciliation logic earlier into the process, leaving the rest of the
    reconciler oblivious of reconnecting allocations.
    
    It also uses the full set of allocations to search for replacements,
    stopping them even if they are not in the `untainted` set.
    
    The system `SystemScheduler` is not affected by this bug because
    disconnected clients don't trigger replacements: every eligible client
    is already running an allocation.
    lgfa29 committed Mar 22, 2023
    Configuration menu
    Copy the full SHA
    3ab49e9 View commit details
    Browse the repository at this point in the history
  3. changelog: add entry for #16609

    lgfa29 committed Mar 22, 2023
    Configuration menu
    Copy the full SHA
    a87af84 View commit details
    Browse the repository at this point in the history
  4. fix reconciler test

    lgfa29 committed Mar 22, 2023
    Configuration menu
    Copy the full SHA
    66241b8 View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2023

  1. fix typo

    lgfa29 committed Mar 23, 2023
    Configuration menu
    Copy the full SHA
    f88c15a View commit details
    Browse the repository at this point in the history
  2. scheduler: handle reconnecting replacements

    If the replacement for a reconnecting allocation is also reconnecting we
    need to make sure we only compare the original with the replacement, and
    not the other way around, otherwise the replacement may stop the
    original if they tie in the selection criteria.
    lgfa29 committed Mar 23, 2023
    Configuration menu
    Copy the full SHA
    0b97e34 View commit details
    Browse the repository at this point in the history

Commits on Mar 24, 2023

  1. scheduler: remove filterByFailedReconnect method

    Since we are now already iterating over the reconnecting allocations in
    a specific method having a separate loop to find failed allocations is
    unnecessary.
    lgfa29 committed Mar 24, 2023
    Configuration menu
    Copy the full SHA
    129dda0 View commit details
    Browse the repository at this point in the history