Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad: include snapshot index when submitting plans #5791

Merged
merged 4 commits into from
Jul 17, 2019

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Jun 7, 2019

tl;dr

Ensure leader's state snapshot is at or after max(previousPlanResultIndex, plan.SnapshotIndex) when evaluating and applying a plan.

Background

Workers receive evals, create plans, and submit them to the leader for evaluation and applying. This scheduling pipeline runs concurrent with Raft over RPC between server agents to provide optimistic parallelism. The leader is responsible for evaluating workers' plans by ensuring they do not conflict and then applying them. Since this entire process is concurrent with Raft consensus, certain invariants must be met by the leader to ensure plans are evaluated and applied serially:

  1. The leader's state used to evaluate plans must be equal to or newer than the state used to create the plan to ensure all referenced objects exist.
  2. The leader's state used to evaluate and apply plans must be after previously applied plans (serialization).

Implementation

The first invariant is enforced by ensuring the state snapshot used to evaluate plans is >= the snapshot at which the plan was created. (new in 1d670f2)

The second invariant is enforced through 3 mechanisms:

  • The previous plan is optimistically applied to the leader's state when evaluating a plan. (pre-existing functionality)
  • A Raft barrier is emitted after leader election which among other things ensures the leader's state includes all previously applied plans. (pre-existing functionality)
  • The leader's state is at or after the previous plan result's index. (new in f82f2a6)

If either invariant cannot be met the plan fails and the worker must reprocess the evaluation.

Previous Attempt

The previous attempt in #5411 attempted to enforce the invariants by ensuring the leader's state had caught up to the latest Raft index. It also contained some implementation bugs, but the concept was a valid way to enforce the invariants. However, it suffered from a couple issues:

  • Certain Raft log entries, such as Barriers, do not increment any indexes in the state store, so the plan applier could never make progress if the previous log entry was a Barrier.
  • Waiting for the leader's state to catch up to Raft's LastIndex, even if possible, could cause unnecessary blocking not required to enforce consistency. As long as the state is at a point to provide the invariants above, it may lag behind Raft's LastIndex without affecting correctness.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plans after the
first plan after a leader election.

Curious how is that guaranteed now? If two workers submitted independent non-conflicting plans with the same snapshot index N -1; planner would apply first plan and commit with index N, where does it wait until N is commited and applied to State before evaluating the second plan?

nomad/plan_apply.go Outdated Show resolved Hide resolved
nomad/plan_apply.go Outdated Show resolved Hide resolved
Plan application should use a state snapshot at or after the Raft index
at which the plan was created otherwise it risks being rejected based on
stale data.

This commit adds a Plan.SnapshotIndex which is set by workers when
submitting plan. SnapshotIndex is set to the Raft index of the snapshot
the worker used to generate the plan.

Plan.SnapshotIndex plays a similar role to PlanResult.RefreshIndex.
While RefreshIndex informs workers their StateStore is behind the
leader's, SnapshotIndex is a way to prevent the leader from using a
StateStore behind the worker's.

Plan.SnapshotIndex should be considered the *lower bound* index for
consistently handling plan application.

Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plans *after* the
first plan after a leader election.

The Raft barrier on leader election ensures the leader's statestore has
caught up to the log index at which it was elected. This guarantees its
StateStore is at an index > lastPlanIndex.
The previous commit prevented evaluating plans against a state snapshot
which is older than the snapshot at which the plan was created.  This is
correct and prevents failures trying to retrieve referenced objects that
may not exist until the plan's snapshot. However, this is insufficient
to guarantee consistency if the following events occur:

1. P1, P2, and P3 are enqueued with snapshot @ 100
2. Leader evaluates and applies Plan P1 with snapshot @ 100
3. Leader evaluates Plan P2 with snapshot+P1 @ 100
4. P1 commits @ 101
4. Leader evaluates applies Plan P3 with snapshot+P2 @ 100

Since only the previous plan is optimistically applied to the state
store, the snapshot used to evaluate a plan may not contain the N-2
plan!

To ensure plans are evaluated and applied serially we must consider all
previous plan's committed indexes when evaluating further plans.

Therefore combined with the last PR, the minimum index at which to
evaluate a plan is:

    min(previousPlanResultIndex, plan.SnapshotIndex)
Rename SnapshotAfter to SnapshotMinIndex. The old name was not
technically accurate. SnapshotAtOrAfter is more accurate, but wordy and
still lacks context about what precisely it is at or after (the index).

SnapshotMinIndex was chosen as it describes the action (snapshot), a
constraint (minimum), and the object of the constraint (index).
@schmichael schmichael marked this pull request as ready for review June 24, 2019 19:21
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me but would like to do one more round of review if no-one else does.

nomad/plan_apply.go Show resolved Hide resolved
nomad/plan_apply.go Show resolved Hide resolved
const timeout = 5 * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeout)
snap, err := p.fsm.State().SnapshotMinIndex(ctx, minIndex)
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a defer for the cancel invocation, it's a little more idomatic and is possible now that you've extracted the function.

@schmichael
Copy link
Member Author

Plans must also be committed serially, so Plan N+1 should use a state
snapshot containing Plan N. This is guaranteed for plans after the
first plan after a leader election.

Curious how is that guaranteed now? If two workers submitted independent non-conflicting plans with the same snapshot index N -1; planner would apply first plan and commit with index N, where does it wait until N is commited and applied to State before evaluating the second plan?

If 3 Plans (P1, P2, and P3) are all committed with the same priority and snapshot index the following could occur (note that the order of the plans is arbitrary; concurrent workers could have submitted them in any order).

  1. Leader dequeues P1
  2. Leader waits until state == P1.SnapshotIndex (100)
  3. Leader evaluates and applies P1
  4. Leader optimistically adds P1 to state snapshot @ 100
  5. Leader dequeues P2, state is already at P2.SnapshotIndex (100)
  6. Leader evaluates P2, waits for P1 to commit, waits for state == P1 commit index (101), applies
  7. Leader optimistically adds P2 to state snapshot @ 101
  8. Leader dequeues P3, state (@101) is already past P3 SnapshotIndex (100)

Leader now evaluates P3 against a state containing both P1 (from index 101) and P2 (from optimistic insert) and the logic proceeds. Even if SnapshotIndex jumps wildly between future and past values (relative to the leader's state index) depending on queue depth, plan priority, and/or worker speed, the optimistic insert and waiting on plan commit index ensures plans are evaluated against a state snapshot containing all previous plans.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - thank you for following through this. The code comments explains the logic here so well, thanks!

@schmichael schmichael merged commit bcfb39d into master Jul 17, 2019
@schmichael schmichael deleted the b-plan-snapshotindex branch July 17, 2019 16:25
schmichael added a commit that referenced this pull request Jul 19, 2019
schmichael added a commit that referenced this pull request Jun 7, 2021
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:

1. "Snapshot" is an overloaded term in Nomad and operators can't be
   expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
   before taking a snapshot: the raft index of the object to be
   processed (plan or eval).

The new description tries to cram all of that context into the tiny
space provided.

See #5791 for details about the `wait_for_index` mechanism in general.
schmichael added a commit that referenced this pull request Jun 7, 2021
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:

1. "Snapshot" is an overloaded term in Nomad and operators can't be
   expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
   before taking a snapshot: the raft index of the object to be
   processed (plan or eval).

The new description tries to cram all of that context into the tiny
space provided.

See #5791 for details about the `wait_for_index` mechanism in general.
tgross pushed a commit that referenced this pull request Jun 9, 2021
Old description of `{plan,worker}.wait_for_index` described the metric
in terms of waiting for a snapshot which has two problems:

1. "Snapshot" is an overloaded term in Nomad and operators can't be
   expected to know which use we're referring to here.
2. The most important thing about the metric is what we're waiting *on*
   before taking a snapshot: the raft index of the object to be
   processed (plan or eval).

The new description tries to cram all of that context into the tiny
space provided.

See #5791 for details about the `wait_for_index` mechanism in general.
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants