Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize plan before sending to increase the plan apply throughput #5407

Closed
wants to merge 7 commits into from

Conversation

arshjohar
Copy link
Contributor

This PR adds normalization of the plan to commit only the diff for stopped and preempted allocs to the raft log to enable better throughput. It also starts using omitempty on some of the structs during msgpack serialization to omit the empty fields.

jrasell and others added 7 commits March 4, 2019 12:01
Currently when operators need to log onto a machine where an alloc
is running they will need to perform both an alloc/job status
call and then a call to discover the node name from the node list.

This updates both the job status and alloc status output to include
the node name within the information to make operator use easier.

Closes #2359
Cloess #1180
},
Deployment: plan.Deployment,
DeploymentUpdates: plan.DeploymentUpdates,
EvalID: plan.EvalID,
NodePreemptions: preemptedAllocs,
}

if h.optimizePlan {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any usages for the SubmitPlan method of Harness, but changed the code to be able to support the newer format of the struct.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite done. Looks good so far and made me think of another optimization that might have a tiny impact: #5452

Will finish up ASAP. Great work @arshjohar!

nomad/util.go Show resolved Hide resolved
nomad/plan_apply.go Show resolved Hide resolved
nomad/plan_apply_test.go Show resolved Hide resolved
nomad/plan_apply_test.go Show resolved Hide resolved
nomad/plan_normalization_test.go Show resolved Hide resolved
nomad/state/state_store.go Show resolved Hide resolved
nomad/state/state_store_test.go Show resolved Hide resolved
nomad/state/state_store_test.go Show resolved Hide resolved
nomad/structs/structs.go Show resolved Hide resolved
nomad/structs/structs.go Show resolved Hide resolved
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spot any logic errors. I think it's worth considering new types for the Alloc fragments once more. The Stopped/Preempted Allocs just use so few fields I can't imagine many methods are reused from Alloc. Using independent types just gives us extra protection against nil-pointer panics on the servers due to a developer thinking an Allocation is fully hydrated when its not.

Otherwise please just comment as many funcs/methods as possible in the form:

// FuncName something something something.
func FuncName() {}

nomad/structs/structs.go Show resolved Hide resolved
@DingoEatingFuzz
Copy link
Contributor

@arshjohar, the 0.9.1-dev branch has been merged into master. Please reopen this pointed to master.

return fmt.Errorf("alloc lookup failed: %v", err)
}
if alloc == nil {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return an error that bubbles up such that the plan apply fails and the worker is forced to do a index refresh.
Otherwise if there's a race between a forced garbage collection and the scheduler making an update to the alloc, the alloc could be gone from the state store before it gets here and silently return true though the update didn't actually make it.
cc @dadgar to double check the above^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safer to return an error here as it indicates that the scheduler made an update on stale information. I don't think it is likely to ever be hit though because it not being in the state store means that the allocation has been GC'd which is only possible if the user forced GC'd between the scheduler snapshot and plan apply.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants