Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a panic related to updating alloc taskgroups #5805

Merged
merged 3 commits into from
Jun 11, 2019
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jun 10, 2019

This fixes a panic in the client code in 0.9.3 that occur if a name of a job's TaskGroup changed. The cause being that server updated the stopped alloc with updated job so the alloc job task group no longer matched the actual task group the alloc is running.

In a sense, this is a follow up of #5717 but for stopped allocs instead of pre-empted allocs.

The test captures the failing condition in https://travis-ci.org/hashicorp/nomad/jobs/543933168 .

Also, I added some defensive code so that the alloc runner uses the tasks map that it initialized internally when starting the alloc rather than rely on alloc job field not getting modified.

Mahmood Ali added 3 commits June 10, 2019 17:14
When an alloc is stopped, test that we don't update the job found in
alloc with new job  that is no longer relevent for this alloc.
Alloc runner already tracks tasks associated with alloc.  Here, we
become defensive by relying on the alloc runner tracked tasks, rather
than depend on server never updating the job unexpectedly.
@notnoop notnoop requested a review from preetapan June 11, 2019 00:57
@@ -230,18 +230,18 @@ func (s *StateStore) UpsertPlanResults(index uint64, results *structs.ApplyPlanR
return err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we used to update the Job field on all allocs (including stopped) in 0.9.1. The plan applier used to add both stopped and new allocs here [1] , and the state store would set the job from the plan to all allocs [2].

So something else must have changed in tandem to cause this panic in 0.9.2. Were you able to determine if any other additional changes besides the plan applier changes contributed to this?

[1] https://github.com/hashicorp/nomad/blob/v0.9.1/nomad/plan_apply.go#L168
[2] https://github.com/hashicorp/nomad/blob/v0.9.1/nomad/state/state_store.go#L191

Was there something in allocrunner that changed as well in 0.9.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 0.9.1, DenormalizeAllocationJobs would set alloc.Job if it's not set already and it's not terminal: https://github.com/hashicorp/nomad/blob/v0.9.1/nomad/structs/funcs.go#L275-L286 .

I confirmed this behavior by adding some logging statements in various places. I can demo it offline.

@notnoop notnoop merged commit 7423b30 into master Jun 11, 2019
@notnoop notnoop deleted the b-tg-rename-panic branch June 11, 2019 12:54
@github-actions
Copy link

github-actions bot commented Feb 9, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants