Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of client: fix waiting on preempted alloc into release/1.1.x #12789

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #12779 to be assessed for backporting due to the inclusion of the label backport/1.1.x.

The below text is copied from the body of the original PR.


Fixes #10200

The bug

https://github.com/hashicorp/nomad-enterprise/issues/707

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops after the scheduling decision was made to
preempt it, but before the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! previous_alloc in the logs was a
valid 36 character UUID!

The fix

The fix is:

-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the UUID must be 36 characters
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

The repro

I was able to reproduce this with a dev agent with preemption enabled and lowered limits for ease of repro.

First I started a low priority count 3 job, then a high priority job that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the remotePrevAlloc implementation, it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-preempt-error/miserably-assured-lion branch from 3a7565f to 5af2c70 Compare April 26, 2022 20:15
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 04b23b8 into release/1.1.x Apr 26, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/b-preempt-error/miserably-assured-lion branch April 26, 2022 20:15
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants