Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Closed
nvanbenschoten opened this issue Aug 7, 2017 · 38 comments · Fixed by #94165
Closed

kv: make disk I/O asynchronous with respect to Raft state machine #17500

nvanbenschoten opened this issue Aug 7, 2017 · 38 comments · Fixed by #94165
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 7, 2017

⚠️ EDIT: The original optimization proposed here was implemented in #38954. See #17500 (comment) for the remainder of this issue.

Raft, along with most consensus protocols in the Paxos family, distinguishes committed entries from executed entries. An entry is committed when the Raft protocol has completed and the entry has been sufficiently replicated and persisted in the Raft log. The entry is then executed when it is applied to a given replica's state machine. This distinction is important because in cases where the execution of an entry onto the state machine does not produce a result, it is not necessary to wait for execution before sending a commit notification to clients.

Currently in Cockroach, the proposing replica waits until command execution before responding to clients. However, changes made for PropEval KV assured that all MVCC related logic is made upstream of Raft and that by the time we reach entry execution, we're simply applying a WriteBatch to RockDB. While some of these execution steps can create ReplicaCorruptionErrors, I don't think it's necessary or possibly even correct that we attach these errors to the proposal response itself. This is because the entry has already been replicated through Raft, so a local ReplicaCorruptionError doesn't mean that all replicas are corrupted or that the command failed. In fact, after looking at the code I don't think that proposalResult needs anything populated at execution time for correctness. Because of this, I have a suspicion that all of the time spent in applyRaftCommand, including writing to RocksDB and performing stats computations, is unnecessary latency.

Preliminary results from a (very) rough draft of the change show a 2-3% improvement on average latency for the KV benchmark with a --batch size of 4:

Without change (batch=4):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s        0         597164          995.3     32.2     32.5     41.9     65.0    939.5

BenchmarkBlocks	  597164	   1004753.2 ns/op

With change (batch=4):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s        0         612652         1021.1     31.4     30.4     41.9     58.7    604.0

BenchmarkBlocks	  612652	    979351.4 ns/op

With a --batch size of 40, the results were even more pronounced:

Without change (batch=40):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.0s        0         616040         2053.4    134.0     54.5    151.0   2415.9  10200.5

BenchmarkBlocks	  616040	    486990.5 ns/op

With change (batch=40):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.0s        0         708040         2360.1    114.7     54.5    130.0   1140.9  10200.5

BenchmarkBlocks	  708040	    423713.8 ns/op

Note that while I think we can respond to clients sooner, I don't think we can pop the command out of the CommandQueue until after it has actually been executed because of how proposer evaluated kv works upstream of Raft.

@tschottdorf @bdarnell

Jira issue: CRDB-6037

Epic CRDB-22644

@nvanbenschoten nvanbenschoten self-assigned this Aug 7, 2017
@andreimatei
Copy link
Contributor

But there's a lease check in processRaftCommand (which is "downstream" of Raft). The result depends on that check...

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Aug 7, 2017

The lease check happens before applyRaftCommand is called though, which is what I'm trying to avoid waiting on. We can still check the lease before replying like we do now while avoiding the extra latency created by the majority of entry execution. Somewhat accidentally, this is exactly what my PoC branch does.

@petermattis
Copy link
Collaborator

I can't think of anything off-hand that would break with this change, but @tschottdorf and @bdarnell know this area of code the best.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 8, 2017

I think it would be fine to respond to the client before applying the write batch. I'm not sure it's OK to do so before we update the local HLC with the command's timestamp, so I'd move it down a little further.

How does this interact with @irfansharif's #16624? Applying the write batch should be faster when we've decoupled it from the raft log and its synchronous writes, so the benefit of this change may be smaller.

@petermattis
Copy link
Collaborator

Mentioned this to @nvanbenschoten in person: a further optimization here would be to respond to all of the committed commands before applying any of the associated write batches. I have an abandoned PR that combined the Raft batches. Probably worth taking another look at it: #15648

@nvanbenschoten
Copy link
Member Author

@petermattis the only thing to be careful with there is that none of the batches' corresponding commands can be removed from the CommandQueue until we apply the merged WriteBatch. I like the idea though and will investigate because there's some serious overlap here.

I'm not sure how this interacts with #16624, but I would also expect it to reduce the benefit we can gain from this change. Still, right now we're just leaving extra latency on the table, which could be especially detrimental to large Raft batches.

@petermattis
Copy link
Collaborator

Note that #16624 isn't making it into 1.1 due to the modest improvement combined with the stability concerns. We should keep #16624 in mind while making changes, but certainly not use it to block nearer term wins.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 8, 2017

OK, if we've decided to punt #16624 into 1.2, this seems like a good idea.

@nvanbenschoten nvanbenschoten added this to the 1.1 milestone Aug 9, 2017
@petermattis petermattis modified the milestones: 1.2, 1.1 Aug 17, 2017
@petermattis
Copy link
Collaborator

Another area for investigation is handling applying Raft commands differently on followers vs the leader. Followers need to keep track of the commit Raft log index, but they don't actually need to apply the commands until they become the leader. At the very least this suggests there are opportunities for batching of the command application.

@bdarnell
Copy link
Contributor

they don't actually need to apply the commands until they become the leader

They need to be caught up (at least with respect to all ChangeReplicas commits) to become a candidate, not just the leader. And I think even as a follower, falling too far behind on ChangeReplicas can be a bad thing. But yes, queueing up changes and batching them on followers can be helpful.

@bdarnell
Copy link
Contributor

Another possibility: Currently, in Replica.handleRaftReady, we write all new log entries and the HardState to disk before sending any messages. This is conservative; some messages can be sent concurrently with the write to disk and this would improve latency by allowing the followers to start their disk syncs before the leader has completed its.

Specifically, a MsgApp can be sent before the Entries it contains have been synced to disk. However, MsgApp also contains the Commit index (stored in the HardState), and it cannot be sent until the corresponding HardState has been persisted (I think). In theory, most MsgApps either contain new Entries or an updated Commit index, but not both, so many MsgApp messages should be able to take advantage of this optimization.

@nvanbenschoten
Copy link
Member Author

Batching the application of Raft commands would also be simpler on followers because followers don't need to deal with the proposal of future commands. Prop eval KV requires that future proposals look at the MVCC/engine layer to compute its WriteBatch after all prerequisite commands have applied their commands. This constrains us in the way I said before:

Note that while I think we can respond to clients sooner, I don't think we can pop the command out of the CommandQueue until after it has actually been executed because of how proposer evaluated kv works upstream of Raft.

I foresee this constraint making it more difficult for any batching mechanism on the leaseholder, because batching might delay the proposal of future commands.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Aug 30, 2017

@bdarnell you're referencing the optimized Raft pipeline from section 10.2.1 in the Raft thesis, right?

screen shot 2017-08-30 at 3 53 22 pm

It certainly seems like a clear win, although I'm not sure the interface exposed by etcd/raft would be suitable for the full extent of the optimization:

The leader may even commit an entry before it has been written to its own disk, if a majority of
followers have written it to their disks; this is still safe

I doubt supporting that case is very important in practice anyway, though.

@bdarnell
Copy link
Contributor

Yes, more or less, although that diagram doesn't quite work for the architecture of etcd/raft. And I agree that the edge case of a leader committing an entry without having it in its own log is not worth supporting.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 23, 2017
…ecute

This change addresses the first optimization discussed in cockroachdb#17500.

The change seems to work and provides a modest performance boost.
Unfortunately, I don't think we'll want to consider merging it at
the moment. The problem is that while it is technically safe to
respond to clients before performing the Raft command application,
doing so is a nightmare for testing. Pretty much every
test in the `storage` package expects to be able to perform an
operation and then "reach beneath raft" immediately to operate
on the result. This can range from inspecting Raft entries to
working on the most up-to-date `Replica` state.

To support this change, all of these tests would need to be
updated to handle the now asynchronous operations performed
in `handleEvalResultRaftMuLocked`. I addressed this by adding
a testing knob called `DisableRaftRespBeforeApplication` in
this change. The problem is that I don't feel very comfortable
with it because we basically need to use it for all tests
(indirectly through `multiTestContext` and `LocalTestCluster`)
which means that we probably aren't testing this optimization
thoroughly. We could disable the optimization on a finer
granularity but this would become a serious issue for
maintainability and I'm not sure it would be worth it.

Perhaps there's some middle ground between returning to the
client after performing in-memory state updates but before
performing persistent state updates? Something like calling:
1. `handleEvalResultRaftMuLocked`
2. `maybeRespondToClient`
3. `applyRaftCommand`

This would solve a lot of the testing issues present here without
the need to use the `DisableRaftRespBeforeApplication` knob, but
I'm almost certain that wouldn't be safe to do.

I think cockroachdb#15648 will run into a similar issue to this. We'll either
need to block clients while we combine Raft batches or we'll need
to update tests which expect a client response to be an indication
that the command has already been applied in all cases. Things
might not be as bad in that case though because less is being
done asynchronously.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 7, 2017
Referenced in cockroachdb#17500.

This change implements the optimization in the Raft thesis under the
section: 10.2.1 Writing to the leader’s disk in parallel. The optimization
allows the leader to sync new entries to its disk after it has sent the
corresponding `MsgApp` messages, instead of before.

Here, we invoke this optimization by:
1. sending all MsgApps.
2. syncing all entries and Raft state to disk.
3. sending all other messages.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 12, 2017
Referenced in cockroachdb#17500.

This change implements the optimization in the Raft thesis under the
section: 10.2.1 Writing to the leader’s disk in parallel. The optimization
allows the leader to sync new entries to its disk after it has sent the
corresponding `MsgApp` messages, instead of before.

Here, we invoke this optimization by:
1. sending all MsgApps.
2. syncing all entries and Raft state to disk.
3. sending all other messages.

Release note (performance improvement): Raft followers now write to
their disks in parallel with the leader.
@nvanbenschoten
Copy link
Member Author

Most of this was addressed in #19229. The original idea was tested in #18710, where it did not show a significant speedup. I may revisit that branch in the future.

There are a few other ideas here related to batching of disk writes/syncs beneath Raft. None of these will be addressed in the 2.0 timeframe.

@nvanbenschoten nvanbenschoten removed this from the 2.0 milestone Feb 12, 2018
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Nov 9, 2022
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Nov 10, 2022
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Nov 10, 2022
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
tbg pushed a commit to etcd-io/raft that referenced this issue Dec 21, 2022
Fixes #12257.

This change adds opt-in support to raft to perform local storage writes
asynchronously from the raft state machine handling loop.

A new AsyncStorageWrites configuration instructs the raft node to write to its
local storage (raft log and state machine) using a request/response message
passing interface instead of the default `Ready`/`Advance` function call
interface. Local storage messages can be pipelined and processed asynchronously
(with respect to `Ready` iteration), facilitating reduced interference between
Raft proposals and increased batching of log appends and state machine
application. As a result, use of asynchronous storage writes can reduce
end-to-end commit latency and increase maximum throughput.

When AsyncStorageWrites is enabled, the `Ready.Message` slice will include new
`MsgStorageAppend` and `MsgStorageApply` messages. The messages will target a
`LocalAppendThread` and a `LocalApplyThread`, respectively. Messages to the same
target must be reliably processed in order. In other words, they can't be
dropped (like messages over the network) and those targeted at the same thread
can't be reordered. Messages to different targets can be processed in any order.

`MsgStorageAppend` carries Raft log entries to append, election votes to persist,
and snapshots to apply. All writes performed in response to a `MsgStorageAppend`
are expected to be durable. The message assumes the role of the Entries,
HardState, and Snapshot fields in Ready.

`MsgStorageApply` carries committed entries to apply. The message assumes
the role of the CommittedEntries field in Ready.

Local messages each carry one or more response messages which should be
delivered after the corresponding storage write has been completed. These
responses may target the same node or may target other nodes. The storage
threads are not responsible for understanding the response messages, only
for delivering them to the correct target after performing the storage
write.

\## Design Considerations

- There must be no regression for existing users that do not enable `AsyncStorageWrites`.
  For instance, CommittedEntries must not wait on unstable entries to be stabilized in
  cases where a follower is given committed entries in a MsgApp.
- Asynchronous storage work should use a message passing interface, like the
  rest of this library.
- The Raft leader and followers should behave symmetrically. Both should be able
  to use asynchronous storage writes for log appends and entry application.
- The LocalAppendThread on a follower should be able to send MsgAppResp messages
  directly to the leader without passing back through the raft state machine
  handling loop.
- The `unstable` log should remain true to its name. It should hold entries
  until they are stable and should not rely on an intermediate reliable cache.
- Pseudo-targets should be assigned to messages that target the local storage
  systems to denote required ordering guarantees.
- Code should be maximally unified across `AsyncStorageWrites=false` and
  `AsyncStorageWrites=true`. `AsyncStorageWrites=false` should be a special case of
  `AsyncStorageWrites=true` where the library hides the possibility of asynchrony.
- It should be possible to apply snapshots asynchronously, even though a
  snapshot touches both the Raft log state and the state machine. The library
  should make this easy for users to handle by delaying all committed entries
  until after the snapshot has applied, so snapshot application can be handled
  by 1) flushing the apply thread, 2) sending the `MsgStorageAppend` that contains
  a snapshot to the `LocalAppendThread` to be applied.

\## Usage

When asynchronous storage writes is enabled, the responsibility of code using
the library is different from what is presented in raft/doc.go (which has been
updated to include a section about async storage writes). Users still read from
the Node.Ready() channel. However, they process the updates it contains in a
different manner. Users no longer consult the HardState, Entries, and Snapshot
fields (steps 1 and 3 in doc.go). They also no longer call Node.Advance() to
indicate that they have processed all entries in the Ready (step 4 in doc.go).
Instead, all local storage operations are also communicated through messages
present in the Ready.Message slice.

The local storage messages come in two flavors. The first flavor is log append
messages, which target a LocalAppendThread and carry Entries, HardState, and a
Snapshot. The second flavor is entry application messages, which target a
LocalApplyThread and carry CommittedEntries. Messages to the same target must be
reliably processed in order. Messages to different targets can be processed in
any order. Each local storage message carries a slice of response messages that
must delivered after the corresponding storage write has been completed.

With Asynchronous Storage Writes enabled, the total state machine handling loop
will look something like this:

```go
for {
	select {
	case <-s.Ticker:
		n.Tick()
	case rd := <-s.Node.Ready():
		for _, m := range rd.Messages {
			switch m.To {
			case raft.LocalAppendThread:
				toAppend <- m
			case raft.LocalApplyThread:
				toApply <-m
			default:
				sendOverNetwork(m)
			}
		}
	case <-s.done:
		return
	}
}
```

Usage of Asynchronous Storage Writes will typically also contain a pair of
storage handler threads, one for log writes (append) and one for entry
application to the local state machine (apply). Those will look something like:

```go
// append thread
go func() {
	for {
		select {
		case m := <-toAppend:
			saveToStorage(m.State, m.Entries, m.Snapshot)
			send(m.Responses)
		case <-s.done:
			return
		}
	}
}

// apply thread
go func() {
	for {
		select {
		case m := <-toApply:
			for _, entry := range m.CommittedEntries {
				process(entry)
				if entry.Type == raftpb.EntryConfChange {
					var cc raftpb.ConfChange
					cc.Unmarshal(entry.Data)
					s.Node.ApplyConfChange(cc)
				}
			}
			send(m.Responses)
		case <-s.done:
			return
		}
	}
}
```

\## Compatibility

The library remains backwards compatible with existing users and the change does
not introduce any breaking changes. Users that do not set `AsyncStorageWrites`
to true in the `Config` struct will not notice a difference with this change.
This is despite the fact that the existing "synchronous storage writes"
interface was adapted to share a majority of the same code. For instance,
`Node.Advance` has been adapted to transparently acknowledge an asynchronous log
append attempt and an asynchronous state machine application attempt, internally
using the same message passing mechanism introduced in this change.

The change has no cross-version compatibility concerns. All changes are local to
a process and nodes using asynchronous storage writes appear to behave no
differently from the outside. Clusters are free to mix nodes running with and
without asynchronous storage writes.

\## Performance

The bulk of the performance evaluation of this functionality thus far has been
done with [rafttoy](https://github.com/nvanbenschoten/rafttoy), a benchmarking
harness developed to experiment with Raft proposal pipeline optimization. The
harness can be used to run single-node benchmarks or multi-node benchmarks. It
supports plugable raft logs, storage engines, network transports, and pipeline
implementations.

To evaluate this change, we fixed the raft log (`etcd/wal`), storage engine
(`pebble`), and network transport (`grpc`). We then built (nvanbenschoten/rafttoy#3)
a pipeline implementation on top of the new asynchronous storage writes
functionality and compared it against two other pipeline implementations.

The three pipeline implementations we compared were:
- **basic** (P1): baseline stock raft usage, similar to the code in `doc.go`
- **parallel append + early ack** (P2): CockroachDB's current pipeline, which includes
  two significant variations to the basic pipeline. The first is that it sends
  MsgApp messages to followers before writing to local Raft log (see [commit](cockroachdb/cockroach@b67eb69)
  for explanation), allowing log appends to occur in parallel across replicas.
  The second is that it acknowledges committed log entries before applying them
  (see [commit](cockroachdb/cockroach@87aaea7)
  for explanation).
- **async append + async apply + early ack** (P3): A pipelining using asynchronous storage
  writes with a separate append thread and a separate apply thread. Also uses the same
  early acknowledgement optimization from above to ack committed entries before handing
  them to the apply thread.

All testing was performed on a 3 node AWS cluster of m5.4xlarge instances with
gp3 EBS volumes (16000 IOPS, 1GB/s throughput).

![Throughput vs latency of Raft proposal pipeline implementations](https://user-images.githubusercontent.com/5438456/197925200-11352c09-569b-460c-ae42-effbf407c4e5.svg)

The comparison demonstrates two different benefits of asynchronous storage
writes.

The first is that it reduces end-to-end latency of proposals by 20-25%. For
instance, when serving 16MB/s of write traffic, P1's average latency was 13.2ms,
P2's average latency was 7.3ms, and P3's average latency was 5.24ms. This is a
reduction in average latency of 28% from the optimized pipeline that does not
use asynchronous storage writes. This matches expectations outlined in
cockroachdb/cockroach#17500.

The second is that it increases the maximum throughput at saturation. This is
because asynchronous storage writes can improve batching for both log appends
and log application. In this experiment, we saw the average append batch size
under saturation increase from 928 to 1542, which is a similar ratio to the
increase in peak throughput. We see a similar difference for apply batch sizes.

There is more benchmarking to do. For instance, we'll need to thoroughly verify
that this change does not negatively impact the performance of users of this
library that do not use asynchronous storage writes.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 6, 2023
Fixes cockroachdb#17500.
Waiting on github.com/cockroachdb/pebble/pull/2117.

This commit integrates with the `AsyncStorageWrites` functionality that
we added to Raft in github.com/etcd-io/raft/pull/8.

\## Approach

The commit makes the minimal changes needed to integrate with async
storage writes and pull fsyncs out of the raft state machine loop. It
does not make an effort to extract the non-durable portion of raft log
writes or raft log application onto separate goroutine pools, as was
described in cockroachdb#17500. Those changes will also be impactful, but they're
non trivial and bump into a pipelining vs. batching trade-off, so they
are left as future work items (TODO(nvanbenschoten): open new issues).

With this change, asynchronous Raft log syncs are enabled by the new
`DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117.
The `handleRaftReady` state machine loop continues to initiate Raft log
writes, but it uses the Pebble API to offload waiting on durability to a
separate goroutine. This separate goroutine then sends the corresponding
`MsgStorageAppend`'s response messages where they need to go (locally
and/or to the Raft leader) when the fsync completes. The async storage
writes functionality in Raft makes this all safe.

\## Benchmark Results

The result of this change is reduced interference between Raft
proposals. As a result, it reduces end-to-end commit latency.

github.com/etcd-io/raft/pull/8 presented a collection of benchmark
results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average
and tail latency. However, it doesn't provide the throughput
improvements at the top end because log appends and state machine
application have not yet been extracted into separate goroutine pools,
which would facilitate increased opportunity for batching.

TODO: add images

----

Release note (performance improvement): The Raft proposal pipeline
has been optimized to reduce interference between Raft proposals.
This improves average and tail write latency at high concurrency.
@nvanbenschoten
Copy link
Member Author

The core of this issue is going to be addressed by #94165. That PR integrates the raft library changes that we made in etcd-io/raft#8 to support asynchronous local storage writes (both raft log appends and state machine applications). The PR does so by pulling log append disk write syncs out of the raft state machine loop using the new DB.ApplyNoSyncWait pebble API. The benchmark results attached to that PR demonstrate the kind of latency improvements we had predicted earlier in this issue.

However, the PR does not make an effort to extract the non-durable portion of raft log writes or state machine application onto separate goroutine pools. Both of these changes could have additional benefits (faster state machine loop iteration => less interference between entries, larger append and apply batches => more efficient raft operations). I've opened #94853 and #94854 to track the remainder of this work.

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 9, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 9, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 10, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 11, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to cockroachdb/pebble that referenced this issue Jan 11, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 31, 2023
Fixes cockroachdb#17500.
Waiting on github.com/cockroachdb/pebble/pull/2117.

This commit integrates with the `AsyncStorageWrites` functionality that
we added to Raft in github.com/etcd-io/raft/pull/8.

\## Approach

The commit makes the minimal changes needed to integrate with async
storage writes and pull fsyncs out of the raft state machine loop. It
does not make an effort to extract the non-durable portion of raft log
writes or raft log application onto separate goroutine pools, as was
described in cockroachdb#17500. Those changes will also be impactful, but they're
non trivial and bump into a pipelining vs. batching trade-off, so they
are left as future work items (TODO(nvanbenschoten): open new issues).

With this change, asynchronous Raft log syncs are enabled by the new
`DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117.
The `handleRaftReady` state machine loop continues to initiate Raft log
writes, but it uses the Pebble API to offload waiting on durability to a
separate goroutine. This separate goroutine then sends the corresponding
`MsgStorageAppend`'s response messages where they need to go (locally
and/or to the Raft leader) when the fsync completes. The async storage
writes functionality in Raft makes this all safe.

\## Benchmark Results

The result of this change is reduced interference between Raft
proposals. As a result, it reduces end-to-end commit latency.

github.com/etcd-io/raft/pull/8 presented a collection of benchmark
results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average
and tail latency. However, it doesn't provide the throughput
improvements at the top end because log appends and state machine
application have not yet been extracted into separate goroutine pools,
which would facilitate increased opportunity for batching.

TODO: add images

----

Release note (performance improvement): The Raft proposal pipeline
has been optimized to reduce interference between Raft proposals.
This improves average and tail write latency at high concurrency.
craig bot pushed a commit that referenced this issue Feb 3, 2023
94165: kv: integrate raft async storage writes r=nvanbenschoten a=nvanbenschoten

Fixes #17500.
Epic: CRDB-22644

This commit integrates with the `AsyncStorageWrites` functionality that we added to Raft in etcd-io/raft/pull/8. 

## Approach

The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854.

With this change, asynchronous Raft log syncs are enabled by the new `DB.ApplyNoSyncWait` Pebble API introduced in cockroachdb/pebble/pull/2117.  The `handleRaftReady` state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding `MsgStorageAppend`'s response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe.

## Benchmark Results

The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency.

etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching.

To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency.

_NOTE: log scales on x and y axes_

![Throughput vs  average latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210719-bec842f6-1093-48cd-8be7-05a3d79c2a71.svg)

![Throughput vs  tail latency of write-only workload](https://user-images.githubusercontent.com/5438456/209210777-670a4d25-9516-41a2-b7e7-86b402004536.svg)

Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.

| Throughput (B/s) | Throughput (qps) | Avg. Latency Δ | p99 Latency Δ |
| ---------------- | ---------------- | -------------- | ------------- |
| 63  KB/s         | 1,000            | -10.5%         | -8.8%         |
| 125 KB/s         | 2,000            | -7.1%          | -10.4%        |
| 250 KB/s         | 4,000            | -20%           | -11.2%        |
| 500 KB/s         | 8,000            | -16.6%         | -25.3%        |
| 1 MB/s           | 16,000           | -30.8%         | -44.0%        |
| 2 MB/s           | 32,000           | -38.2%         | -30.9%        |
| 4 MB/s           | 64,000           | 5.9%           | 9.4%          |

### Other benchmark results
```bash
name                   old ops/s    new ops/s    delta
# 50% read, 50% update
ycsb/A/nodes=3          16.0k ± 5%   16.2k ± 4%     ~     (p=0.353 n=10+10)
ycsb/A/nodes=3/cpu=32   28.7k ± 5%   33.8k ± 2%  +17.57%  (p=0.000 n=9+9)
# 95% read, 5% update
ycsb/B/nodes=3          29.9k ± 3%   30.2k ± 3%     ~     (p=0.278 n=9+10)
ycsb/B/nodes=3/cpu=32    101k ± 1%    100k ± 3%     ~     (p=0.274 n=8+10)
# 100% read
ycsb/C/nodes=3          40.4k ± 3%   40.0k ± 3%     ~     (p=0.190 n=10+10)
ycsb/C/nodes=3/cpu=32    135k ± 1%    137k ± 1%   +0.87%  (p=0.011 n=9+9)
# 95% read, 5% insert
ycsb/D/nodes=3          33.6k ± 3%   33.8k ± 3%     ~     (p=0.315 n=10+10)
ycsb/D/nodes=3/cpu=32    108k ± 1%    106k ± 6%     ~     (p=0.739 n=10+10)
# 95% scan, 5% insert
ycsb/E/nodes=3          3.79k ± 1%   3.73k ± 1%   -1.42%  (p=0.000 n=9+9)
ycsb/E/nodes=3/cpu=32   6.31k ± 5%   6.48k ± 6%     ~     (p=0.123 n=10+10)
# 50% read, 50% read-modify-write
ycsb/F/nodes=3          7.68k ± 2%   7.99k ± 2%   +4.11%  (p=0.000 n=10+10)
ycsb/F/nodes=3/cpu=32   15.6k ± 4%   18.1k ± 3%  +16.14%  (p=0.000 n=8+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.01 ± 5%    5.95 ± 4%     ~     (p=0.460 n=10+10)
ycsb/A/nodes=3/cpu=32    5.01 ± 4%    4.25 ± 4%  -15.19%  (p=0.000 n=9+10)
ycsb/B/nodes=3           4.80 ± 0%    4.77 ± 4%     ~     (p=0.586 n=7+10)
ycsb/B/nodes=3/cpu=32    1.90 ± 0%    1.90 ± 0%     ~     (all equal)
ycsb/C/nodes=3           3.56 ± 2%    3.61 ± 3%     ~     (p=0.180 n=10+10)
ycsb/C/nodes=3/cpu=32    1.40 ± 0%    1.40 ± 0%     ~     (all equal)
ycsb/D/nodes=3           2.87 ± 2%    2.85 ± 2%     ~     (p=0.650 n=10+10)
ycsb/D/nodes=3/cpu=32    1.30 ± 0%    1.34 ± 4%     ~     (p=0.087 n=10+10)
ycsb/E/nodes=3           25.3 ± 0%    25.7 ± 1%   +1.38%  (p=0.000 n=8+8)
ycsb/E/nodes=3/cpu=32    22.9 ± 5%    22.2 ± 6%     ~     (p=0.109 n=10+10)
ycsb/F/nodes=3           12.5 ± 2%    12.0 ± 1%   -3.72%  (p=0.000 n=10+9)
ycsb/F/nodes=3/cpu=32    9.27 ± 4%    7.98 ± 3%  -13.96%  (p=0.000 n=8+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           45.7 ±15%    35.7 ± 6%  -21.90%  (p=0.000 n=10+8)
ycsb/A/nodes=3/cpu=32    67.6 ±13%    55.3 ± 5%  -18.10%  (p=0.000 n=9+10)
ycsb/B/nodes=3           30.5 ±24%    29.4 ± 7%     ~     (p=0.589 n=10+10)
ycsb/B/nodes=3/cpu=32    12.8 ± 2%    13.3 ± 7%     ~     (p=0.052 n=10+8)
ycsb/C/nodes=3           14.0 ± 3%    14.2 ± 0%     ~     (p=0.294 n=10+8)
ycsb/C/nodes=3/cpu=32    5.80 ± 0%    5.70 ± 5%     ~     (p=0.233 n=7+10)
ycsb/D/nodes=3           12.4 ± 2%    11.7 ± 3%   -5.32%  (p=0.001 n=10+10)
ycsb/D/nodes=3/cpu=32    6.30 ± 0%    5.96 ± 6%   -5.40%  (p=0.001 n=10+10)
ycsb/E/nodes=3           81.0 ± 4%    83.9 ± 0%   +3.63%  (p=0.012 n=10+7)
ycsb/E/nodes=3/cpu=32     139 ±19%     119 ±12%  -14.46%  (p=0.021 n=10+10)
ycsb/F/nodes=3            122 ±17%     103 ±10%  -15.48%  (p=0.002 n=10+8)
ycsb/F/nodes=3/cpu=32     146 ±20%     133 ± 7%   -8.89%  (p=0.025 n=10+10)
```

The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above).

----

Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

96458: sql: fixes statement contention count metric r=j82w a=j82w

Fixes a bug introduced in #94750 where the metric
count was counting transaction that hit contention events instead of the statement count.

closes: #96429

Release note: none

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: j82w <jwilley@cockroachlabs.com>
@craig craig bot closed this as completed in 702ff6f Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants