Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ack Raft proposals after Raft log commit, not state machine apply #38954

Merged
merged 1 commit into from
Aug 13, 2019

Conversation

nvanbenschoten
Copy link
Member

Informs #17500.

This is a partial revival of #18710 and a culmination of more recent thinking in #17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals.

This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded.

However, there are a few complications in acknowledging the success of a proposal at this stage:

  1. Committing an entry in the raft log and having the command in that entry succeed are similar but not equivalent concepts. Even if the entry succeeds in achieving durability by replicating to a quorum of replicas, its command may still be rejected "beneath raft". This means that a (deterministic) check after replication decides that the command will not be applied to the replicated state machine. In that case, the client waiting on the result of the command should not be informed of its success. Luckily, this check is cheap to perform so we can do it here and when applying the command. See Replica.shouldApplyCommand.
  2. Some commands perform non-trivial work such as updating Replica configuration state or performing Range splits. In those cases, it's likely that the client is interested in not only knowing whether it has succeeded in sequencing the change in the raft log, but also in knowing when the change has gone into effect. There's currently no exposed hook to ask for an acknowledgment only after a command has been applied, so for simplicity, the current implementation only ever acks transactional writes before they have gone into effect. All other commands wait until they have been applied to ack their client.
  3. Even though we can determine whether a command has succeeded without applying it, the effect of the command will not be visible to conflicting commands until it is applied. Because of this, the client can be informed of the success of a write at this point, but we cannot release that write's latches until the write has applied. See ProposalData.signalProposalResult/finishApplication.

Benchmarks

The change appears to result in an 8-10% improvement to throughput and a 6-10% reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups.

name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)

@nvanbenschoten nvanbenschoten requested review from ajwerner and a team July 18, 2019 05:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting results.

The separation between the two apply stages makes the code complicated. I wonder if this can simplify somewhat by extracting a component out that combines the entryGen and the batch and accesses the replica only through the necessary abstraction, at which point you can actually unit test it (for example, generating random batches and asserting that the right decisions are made by mocking out the replica), and the code would simplify (:crossed_fingers:) to something like

// pre-stage
apper := newApplier(r, entryGen, cmdAppBatch)
apper.decodeAndMaybeAckEarly()
// regular stage (receives the apper from above)

apper.processAll()


the first argument to `newApplier` would be some

```go
type appI interface {
  applyEntry(...)
  canApplyEntry(...)
  // probably many more
}

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 6 of 11 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/storage/client_replica_test.go, line 1819 at r3 (raw file):

	verifyKeysWithPrefix := func(prefix roachpb.Key, expectedKeys []roachpb.Key) {
		t.Helper()
		// Raft entry application is asynchronous, so we may not see the updates

How did you go about finding these now-flaky tests? Stress ./pkg/storage for a while?


pkg/storage/client_split_test.go, line 686 at r3 (raw file):

	storeCfg.TestingKnobs.DisableSplitQueue = true
	storeCfg.TestingKnobs.DisableMergeQueue = true
	storeCfg.TestingKnobs.DisableRaftAckBeforeApplication = true

Mind adding a comment to each line like this explaining why we need it in the test?


pkg/storage/cmd_app_batch.go, line 26 at r3 (raw file):

// iterating over committed entries to decode and apply.
//
// The entry and next methods should only be called if valid returns true.

s/should/must/


pkg/storage/replica_application.go, line 44 at r3 (raw file):

// ackCommittedEntriesBeforeAppRaftMuLocked attempts to acknowledge the success
// of raft entries that have been durably committed to the raft log but have not
// yet been applied to the proposer replica's replicated state machine.

Add that whichever entries are handled here will be "unbound", that is, the delayed handling will not see them as needing signaling.

Also this method initializes the local proposals for all entries, which is confusing (you could think that this method is a pure optimization and commenting it out would leave things working, but that doesn't seem to be the case). Perhaps that can be rearranged into its own step or at least called out. Would it make sense to call retrieveLocalProposals


pkg/storage/replica_application.go, line 85 at r3 (raw file):

	ctx context.Context, gen *entryGen, b *cmdAppBatch,
) (errExpl string, err error) {
	errExpl, err = b.decode(ctx, gen)

Add more comments here -- now you will either have nothing, or a chunk of trivial commands, or a nontrivial command. The common case is obviously that we get a chunk of trivial commands and that this is already all commands, but if we get unlucky there's a nontrivial command at the front and then a large chunk of trivial commands, in which case we won't be using the optimization. I think this is all fine, but please add counters for the number of commands that were in principle optimizable but were missed because they did not occur in the first batch. (Ideally this sits on the stats struct that @ajwerner also used in his PR).


pkg/storage/replica_application.go, line 114 at r3 (raw file):

				// Signal the proposal's response channel with the result.
				// Make a copy of the response to avoid data races.

Who are we racing with? retrieveLocalProposals has removed this from the proposal map already.


pkg/storage/replica_application.go, line 138 at r3 (raw file):

// writes. This could be exposed as an option on the batch header instead.
func canAckBeforeApplication(ba *roachpb.BatchRequest) bool {
	return ba.IsTransactionWrite()

Shouldn't it also be trivial? You're not handling any state stuff in your early acks. The two probably coincide but it should be explicit.


pkg/storage/replica_application.go, line 217 at r3 (raw file):
Take it or leave it:

// The caller early-acked some proposals that are being committed right now via ackCommittedEntriesBeforeAppRaftMuLocked, which already decoded the first batch and retrieved the local proposals.


pkg/storage/replica_raft.go, line 566 at r3 (raw file):

	// TODO(nvanbenschoten): I'm pretty sure that etcd/raft returns proposals
	// in rd.Entries and rd.CommittedEntries at the same time for single peer
	// Raft groups. In that case, we'll need to disable this optimization, as

This will also be the case whenever a peer is catching up. We should grab the first Index for entries being appended right now and disable the optimization for anything in CommittedEntries with indexes <= Index.


pkg/storage/replica_test.go, line 8840 at r3 (raw file):

}

// TestAckWriteBeforeApplication tests that the success of transactional writes

Would like to see the negative tested for some requests that aren't supposed to be early acked., i.e. a nontrivial one. I wish we could exercise this whole pipeline in more isolation, but it looks like we're slowly moving there.

@tbg tbg self-requested a review July 18, 2019 08:24
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 11 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

@nvanbenschoten
Copy link
Member Author

@ajwerner I'm curious to hear your perspective on the applier abstraction @tbg was mentioning above. Where would you expect this abstraction boundary to be drawn? Did you have a similar structure in your replication package work? If we make a change like that here then ideally it will be done in such a way that the boundary doesn't need to change even as we move where entry application is performed (e.g. async-application) or where the implementation lives (e.g. a replication package)

@ajwerner
Copy link
Contributor

@ajwerner I'm curious to hear your perspective on the applier abstraction @tbg was mentioning above. Where would you expect this abstraction boundary to be drawn? Did you have a similar structure in your replication package work? If we make a change like that here then ideally it will be done in such a way that the boundary doesn't need to change even as we move where entry application is performed (e.g. async-application) or where the implementation lives (e.g. a replication package)

I think the basic idea makes a lot of sense. I had envisioned application primarily as a client concern of a replication package. In the code I had written committed entries get handed back to the client for application. I hadn't considered batching and the desire to decode and examine a batch before the rest of the ready gets processed but given that we want to do that I'm supportive of encapsulating application logic above replication logic.

@tbg
Copy link
Member

tbg commented Jul 23, 2019

Where this could also come in handy is the datadriven raft testing I've been working on (not quite ready for prime time yet, but @nvanbenschoten has seen it). I'm introducing this primary for testing the nitty-gritty of joint config change interactions including simulated crashes, but I was also thinking that it can be used by apps using Raft (i.e. etcd) to do similar testing of the storage and apply loop in a more end-to-end fashion. Doing so, it would be good to be able to mock out the actual command application.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 24, 2019
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10
`cdc/ledger/rangefeed=true` roachtests after reducing its duration down to
5m. Usually, at least one of these tests would hit the `negative refcount`
assertion. I then incrementally added more and more logging around entry
application until I painted a full picture of which logical ops were being
consumed by the rangefeed processor and why the same raft command was being
applied twice (once it became clear that one was). After a few more rounds
of fine-tuning the logging, the interaction with reproposals with a new max
lease index became clear.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 24, 2019
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this pull request Jul 30, 2019
39149: roachtest: skip kv/contention/nodes=4 for release-19.1 r=tbg a=nvanbenschoten

Fixes #39116.

release-19.1 is susceptible to the issues described in #36089, so it won't reliably pass this test.

39160: storage: add DisableRaftLogQueue to StoreTestingKnobs r=tbg a=nvanbenschoten

Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test.

Release note: None

39161: storage: address TODO in TestPushTxnHeartbeatTimeout r=tbg a=nvanbenschoten

Pulled from #38954, which I want to keep focused, especially with the PR's new secondary focus on refactoring entry application to be easier to mock and test.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this pull request Jul 30, 2019
39158: storage: clean up below-Raft Split and Merge locking r=bdarnell a=nvanbenschoten

This PR contains three related cleanups that I stumbled into while cleaning up raft entry application in preparation for #38954.

The first cleanup is that we avoid improperly handling errors that may be returned when acquiring the split/merge locks. If this operation fails then we can't simply return an error and not apply the command. We need to be deterministic at this point. The only option we have is to fatal. See #19448 (comment).

The second cleanup is that we then remove stale code that was attempting to recover from failed split and merge application. This error handling only made sense in a pre-proposer evaluated KV world.

The third commit addresses a TODO to assert that the RHS range during a split is not initialized. The TODO looks like it went in before proposer evaluated KV, which protects us against both replays and reproposals.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 2, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 2, 2019
…pplication

This will be pulled out into cockroachdb#38954 instead.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 6, 2019
…pplication

This will be pulled out into cockroachdb#38954 instead.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 7, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 7, 2019
39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction.
- Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management.
- Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up.
- The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all.

Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/earlyAck branch 2 times, most recently from 3404a42 to 9ad4965 Compare August 12, 2019 21:18
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've rebased this on top of #39254, which allowed a lot of the code here to dissolve away. Most of this has already been reviewed over there, so I'm not sure how much more love this PR needs on its own.

Outside of the rebase, the only big change is that we're now properly handling raft ready structs with overlap between the Entries and CommittedEntries slices. I updated TestAckWriteBeforeApplication to explore this case.

PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)


pkg/storage/client_replica_test.go, line 1819 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

How did you go about finding these now-flaky tests? Stress ./pkg/storage for a while?

Yes, exactly. I stressed the who package for a while and kept skipping tests until it ran without issue. Most of this went away once we disabled the optimization for single-replica Ranges.


pkg/storage/replica_application.go, line 114 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Who are we racing with? retrieveLocalProposals has removed this from the proposal map already.

Done.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: The refactor with the apply package made this really clean.

Reviewed 20 of 20 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/storage/replica_application_cmd.go, line 121 at r4 (raw file):

// CanAckBeforeApplication implements the apply.CheckedCommand interface.
func (c *replicatedCmd) CanAckBeforeApplication() bool {
	// CanAckBeforeApplication determines whether the request type is compatable

s/compatable/compatible/

how's that one in our dictionary?


pkg/storage/store_test.go, line 2491 at r4 (raw file):

	// Query the range with a single scan, which should cause all intents
	// to be resolved.

Why this here?

…apply

Informs cockroachdb#17500.

This is a partial revival of cockroachdb#18710 and a culmination of more recent
thinking in cockroachdb#17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success
of raft entries as soon as it learns that they have been durably committed to
the raft log instead of after they have been applied to the proposer replica's
replicated state machine. This not only pulls the application latency out of the
hot path for Raft proposals, but it also pulls the next raft ready iteration's
write to its Raft log (with the associated fsync) out of the hot path for Raft
proposals.

This is safe because a proposal through raft is known to have succeeded as
soon as it is replicated to a quorum of replicas (i.e. has committed in the
raft log). The proposal does not need to wait for its effects to be applied
in order to know whether its changes will succeed or fail. The raft log is
the provider of atomicity and durability for replicated writes, not (ignoring
log truncation) the replicated state machine itself, so a client can be
confident in the result of a write as soon as the raft log confirms that it
has succeeded.

However, there are a few complications to acknowledging the success of a
proposal at this stage:

1. Committing an entry in the raft log and having the command in that entry
   succeed are similar but not equivalent concepts. Even if the entry succeeds
   in achieving durability by replicating to a quorum of replicas, its command
   may still be rejected "beneath raft". This means that a (deterministic)
   check after replication decides that the command will not be applied to the
   replicated state machine. In that case, the client waiting on the result of
   the command should not be informed of its success. Luckily, this check is
   cheap to perform so we can do it here and when applying the command. See
   Replica.shouldApplyCommand.

2. Some commands perform non-trivial work such as updating Replica configuration
   state or performing Range splits. In those cases, it's likely that the client
   is interested in not only knowing whether it has succeeded in sequencing the
   change in the raft log, but also in knowing when the change has gone into
   effect. There's currently no exposed hook to ask for an acknowledgement only
   after a command has been applied, so for simplicity the current implementation
   only ever acks transactional writes before they have gone into effect. All
   other commands wait until they have been applied to ack their client.

3. Even though we can determine whether a command has succeeded without applying
   it, the effect of the command will not be visible to conflicting commands until
   it is applied. Because of this, the client can be informed of the success of
   a write at this point, but we cannot release that write's latches until the
   write has applied. See ProposalData.signalProposalResult/finishApplication.

\### Benchmarks

The change appears to result in an **8-10%** improvement to throughput and a
**6-10%** reduction in p50 latency across the board on kv0. I ran a series of
tests with different node sizes and difference workload concurrencies and the
win seemed pretty stable. This was also true regardless of whether the writes
were to a single Raft group or a large number of Raft groups.

```
name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
```

Release note (performance improvement): Raft entries no longer wait to
be applied to the RocksDB storage engine before signaling their success
to clients, they now only wait until they are committed in their Raft log.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs

bors r=ajwerner

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)


pkg/storage/replica_application_cmd.go, line 121 at r4 (raw file):

Previously, ajwerner wrote…

s/compatable/compatible/

how's that one in our dictionary?

Done.


pkg/storage/store_test.go, line 2491 at r4 (raw file):

Previously, ajwerner wrote…

Why this here?

It was some cleanup I made in an earlier version of this PR that I figured I'd carry over.

@craig
Copy link
Contributor

craig bot commented Aug 13, 2019

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request Aug 13, 2019
38954: storage: ack Raft proposals after Raft log commit, not state machine apply r=ajwerner a=nvanbenschoten

Informs #17500.

This is a partial revival of #18710 and a culmination of more recent thinking in #17500 (comment).

The change adjusts the Raft processing loop so that it acknowledges the success of raft entries as soon as it learns that they have been durably committed to the raft log instead of after they have been applied to the proposer replica's replicated state machine. This not only pulls the application latency out of the hot path for Raft proposals, but it also pulls the next raft ready iteration's write to its Raft log (with the associated fsync) out of the hot path for Raft proposals.

This is safe because a proposal through raft is known to have succeeded as soon as it is replicated to a quorum of replicas (i.e. has committed in the raft log). The proposal does not need to wait for its effects to be applied in order to know whether its changes will succeed or fail. The raft log is the provider of atomicity and durability for replicated writes, not (ignoring log truncation) the replicated state machine itself, so a client can be confident in the result of a write as soon as the raft log confirms that it has succeeded.

However, there are a few complications in acknowledging the success of a proposal at this stage:

1. Committing an entry in the raft log and having the command in that entry succeed are similar but not equivalent concepts. Even if the entry succeeds in achieving durability by replicating to a quorum of replicas, its command may still be rejected "beneath raft". This means that a (deterministic) check after replication decides that the command will not be applied to the replicated state machine. In that case, the client waiting on the result of the command should not be informed of its success. Luckily, this check is cheap to perform so we can do it here and when applying the command. See `Replica.shouldApplyCommand`.
2. Some commands perform non-trivial work such as updating Replica configuration state or performing Range splits. In those cases, it's likely that the client is interested in not only knowing whether it has succeeded in sequencing the change in the raft log, but also in knowing when the change has gone into effect. There's currently no exposed hook to ask for an acknowledgment only after a command has been applied, so for simplicity, the current implementation only ever acks transactional writes before they have gone into effect. All other commands wait until they have been applied to ack their client.
3. Even though we can determine whether a command has succeeded without applying it, the effect of the command will not be visible to conflicting commands until it is applied. Because of this, the client can be informed of the success of a write at this point, but we cannot release that write's latches until the write has applied. See `ProposalData.signalProposalResult/finishApplication`.

### Benchmarks

The change appears to result in an **8-10%** improvement to throughput and a **6-10%** reduction in p50 latency across the board on kv0. I ran a series of tests with different node sizes and difference workload concurrencies and the win seemed pretty stable. This was also true regardless of whether the writes were to a single Raft group or a large number of Raft groups.

```
name                           old ops/sec  new ops/sec  delta
kv0/cores=16/nodes=3/conc=32    24.1k ± 0%   26.1k ± 1%   +8.35%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48    30.4k ± 1%   32.8k ± 1%   +8.02%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64    34.6k ± 1%   37.6k ± 0%   +8.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72    46.6k ± 1%   50.8k ± 0%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108   58.8k ± 1%   64.0k ± 1%   +8.99%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144   68.1k ± 1%   74.5k ± 1%   +9.45%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=144   55.8k ± 1%   59.7k ± 2%   +7.12%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216   64.4k ± 4%   68.1k ± 4%   +5.65%  (p=0.016 n=5+5)
kv0/cores=72/nodes=3/conc=288   68.8k ± 2%   74.5k ± 3%   +8.39%  (p=0.008 n=5+5)

name                           old p50(ms)  new p50(ms)  delta
kv0/cores=16/nodes=3/conc=32     1.30 ± 0%    1.20 ± 0%   -7.69%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=48     1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/conc=64     1.70 ± 0%    1.60 ± 0%   -5.88%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    1.84 ± 3%    1.70 ± 0%   -7.61%  (p=0.000 n=5+4)
kv0/cores=72/nodes=3/conc=144    2.00 ± 0%    1.80 ± 0%  -10.00%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    2.46 ± 2%    2.20 ± 0%  -10.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=288    2.80 ± 0%    2.60 ± 0%   -7.14%  (p=0.079 n=4+5)

name                           old p99(ms)  new p99(ms)  delta
kv0/cores=16/nodes=3/conc=32     3.50 ± 0%    3.50 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/conc=48     4.70 ± 0%    4.58 ± 3%     ~     (p=0.167 n=5+5)
kv0/cores=16/nodes=3/conc=64     5.50 ± 0%    5.20 ± 0%   -5.45%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=72     5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=108    5.80 ± 0%    5.50 ± 0%   -5.17%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/conc=144    6.48 ± 3%    6.18 ± 3%   -4.63%  (p=0.079 n=5+5)
kv0/cores=72/nodes=3/conc=144    11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.008 n=5+5)
kv0/cores=72/nodes=3/conc=216    13.4 ± 2%    13.2 ± 5%     ~     (p=0.683 n=5+5)
kv0/cores=72/nodes=3/conc=288    18.2 ± 4%    17.2 ± 3%   -5.70%  (p=0.079 n=5+5)
```

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Contributor

craig bot commented Aug 13, 2019

Build succeeded

@craig craig bot merged commit 87aaea7 into cockroachdb:master Aug 13, 2019
@tbg
Copy link
Member

tbg commented Aug 13, 2019

==================
Write at 0x00c000b2cdfc by goroutine 178:
  sync/atomic.AddInt32()
      /usr/local/go/src/runtime/race_amd64.s:269 +0xb
  github.com/cockroachdb/cockroach/pkg/util/syncutil.(*RWMutex).RLock()
      /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/mutex_sync_race.go:75 +0x54
  github.com/cockroachdb/cockroach/pkg/config.(*SystemConfig).getZoneEntry()
      /go/src/github.com/cockroachdb/cockroach/pkg/config/system.go:299 +0x5e
  github.com/cockroachdb/cockroach/pkg/config.(*SystemConfig).getZoneConfigForKey()
      /go/src/github.com/cockroachdb/cockroach/pkg/config/system.go:334 +0x5e
  github.com/cockroachdb/cockroach/pkg/config.(*SystemConfig).GetZoneConfigForKey()
      /go/src/github.com/cockroachdb/cockroach/pkg/config/system.go:279 +0x159
  github.com/cockroachdb/cockroach/pkg/storage.(*Store).systemGossipUpdate.func2()
      /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1698 +0xf6
  github.com/cockroachdb/cockroach/pkg/storage.(*storeReplicaVisitor).Visit()
      /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:350 +0x2e2
  github.com/cockroachdb/cockroach/pkg/storage.(*Store).systemGossipUpdate()
      /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1696 +0x23c
  github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start.func2()
      /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1409 +0x95
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x15f

Previous read at 0x00c000b2cdfc by goroutine 156:
  reflect.typedmemmove()
      /usr/local/go/src/runtime/mbarrier.go:177 +0x0
  reflect.packEface()
      /usr/local/go/src/reflect/value.go:119 +0x104
  reflect.valueInterface()
      /usr/local/go/src/reflect/value.go:1009 +0x170
  reflect.deepValueEqual()
      /usr/local/go/src/reflect/deepequal.go:132 +0xa61
  reflect.deepValueEqual()
      /usr/local/go/src/reflect/deepequal.go:101 +0xc7a
  reflect.deepValueEqual()
      /usr/local/go/src/reflect/deepequal.go:101 +0xc7a
  reflect.deepValueEqual()
      /usr/local/go/src/reflect/deepequal.go:101 +0xc7a
  reflect.deepValueEqual()
      /usr/local/go/src/reflect/deepequal.go:98 +0x1866
  reflect.DeepEqual()
      /usr/local/go/src/reflect/deepequal.go:196 +0x2f1
  github.com/cockroachdb/cockroach/pkg/storage_test.TestStoreGossipSystemData.func4()
      /go/src/github.com/cockroachdb/cockroach/pkg/storage/client_lease_test.go:208 +0x1f5
  github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoonError.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:48 +0x52
  github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration()
      /go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:184 +0xf4
  github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoonError()
      /go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:54 +0x11a
  github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon()
      /go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:34 +0x59
  github.com/cockroachdb/cockroach/pkg/storage_test.TestStoreGossipSystemData()
      /go/src/github.com/cockroachdb/cockroach/pkg/storage/client_lease_test.go:207 +0xbcf
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:865 +0x163

@tbg
Copy link
Member

tbg commented Aug 13, 2019

There isn't an issue yet. Think this might be new @nvanbenschoten?

@nvanbenschoten
Copy link
Member Author

No, that was because of #39609. We shouldn't have been using DeepEqual on a mutex. I fixed that over there.

@tbg
Copy link
Member

tbg commented Aug 13, 2019 via email

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/earlyAck branch August 13, 2019 18:15
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 15, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
@tbg tbg mentioned this pull request Oct 28, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants