Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: fix waiting on preempted alloc #12779

Merged
merged 1 commit into from
Apr 26, 2022
Merged

client: fix waiting on preempted alloc #12779

merged 1 commit into from
Apr 26, 2022

Commits on Apr 26, 2022

  1. client: fix waiting on preempted alloc

    Fixes #10200
    
    **The bug**
    
    A user reported receiving the following error when an alloc was placed
    that needed to preempt existing allocs:
    
    ```
    [ERROR] client.alloc_watcher: error querying previous alloc:
    alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
    failed: index error: UUID must be 36 characters"
    ```
    
    The previous alloc (8e) was already complete on the client. This is
    possible if an alloc stops *after* the scheduling decision was made to
    preempt it, but *before* the node running both allocations was able to
    pull and start the preemptor. While that is hopefully a narrow window of
    time, you can expect it to occur in high throughput batch scheduling
    heavy systems.
    
    However the RPC error made no sense! `previous_alloc` in the logs was a
    valid 36 character UUID!
    
    **The fix**
    
    The fix is:
    
    ```
    -		prevAllocID:  c.Alloc.PreviousAllocation,
    +		prevAllocID:  watchedAllocID,
    ```
    
    The alloc watcher new func used for preemption improperly referenced
    Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
    multiple allocs are preempted, a watcher is created for each with
    watchedAllocID set properly by the caller. In this case
    Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
    error was coming from! Sadly we were properly referencing
    watchedAllocID in the log, so it made the error make no sense!
    
    **The repro**
    
    I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
    and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
    for ease of repro.
    
    First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
    then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
    that evicts 2 low priority jobs. Everything worked as expected.
    
    However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
    it reproduces the bug because the watcher references PreviousAllocation
    instead of watchedAllocID.
    schmichael committed Apr 26, 2022
    Configuration menu
    Copy the full SHA
    5328f3c View commit details
    Browse the repository at this point in the history