Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: fix waiting on preempted alloc #12779

Merged
merged 1 commit into from
Apr 26, 2022
Merged

client: fix waiting on preempted alloc #12779

merged 1 commit into from
Apr 26, 2022

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Apr 26, 2022

Fixes #10200

The bug

https://github.com/hashicorp/nomad-enterprise/issues/707

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops after the scheduling decision was made to
preempt it, but before the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! previous_alloc in the logs was a
valid 36 character UUID!

The fix

The fix is:

-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the UUID must be 36 characters
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

The repro

I was able to reproduce this with a dev agent with preemption enabled and lowered limits for ease of repro.

First I started a low priority count 3 job, then a high priority job that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the remotePrevAlloc implementation, it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

It looks like this is a very old bug (pre-0.9) but I spent a little time going thru lingering open issues and didn't manage to find one for it. #10200 looks like it could be related but we kind of went off in the weeds of the cluster topology in that issue so I'm not sure.

Don't forget the changelog entry and backport labels!

Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
@schmichael
Copy link
Member Author

schmichael commented Apr 26, 2022

@tgross Thanks for making me take another look at #10200. At first I assumed these were unrelated, but now I'm not so sure! At the very least #10200 exhibits the same UUID must be 36 characters log line due to the same bug where the wrong field is used when waiting on a previous/preempted allocation.

So I'm actually going the ambitious route of closing that one with this. If there's another compounding bug in #10200 hopefully someone will reopen and file a new issue. I just don't see us making progress on #10200 without a repro after this lands, and I didn't want folks to think there was no progress on it. (I'll leave a similar note on that issue when merging this.)

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alloc.GetAlloc fails on rebooted client
3 participants