Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't persist allocs of destroyed alloc runners #6207

Merged
merged 2 commits into from
Aug 26, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Aug 25, 2019

This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes #5984
Related to #5890

This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes #5984
Related to #5890
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On client restart, client assumes that alloc is pending state and re-runs it.

That line from the description seems like the much bigger bug. Perhaps we should just remove the DeleteTaskBucket calls altogether, so they're only deleted as part of the atomic alloc bucket deletion.

That way if the agent crashes mid-GC, the alloc state should be terminal when the agent restarts. The partially GC'd alloc would not be restarted and again be eligible for GC to finish what the previous run started.

Update: Talked with Mahmood and good news! We already atomically delete the alloc+task buckets. The problem is that before this PR we could resurrect just the alloc and get into this state. This PR prevents an alloc being stored after being GC'd which prevents agent restarts from thinking its pending.

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved
Protect against a race where destroying and persist state goroutines
race.

The downside is that the database io operation will run while holding
the lock and may run indefinitely.  The risk of lock being long held is
slow destruction, but slow io has bigger problems.
@github-actions
Copy link

github-actions bot commented Feb 3, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runaway nomad process after Nomad client reboot
2 participants