See cockroachdb#17500.
This commit polishes and flushes a prototype I've had lying around that
demonstrates the async Raft log appends component of cockroachdb#17500. I'm not actively
planning to productionize this, but it sounds like we may work on this project
in v23.1, so this prototype might help. It also demonstrates the kind of
performance wins we can expect to see on write-heavy workloads. To this point,
we had only [demonstrated the potential speedup](cockroachdb#17500 (comment))
in a simulated environment with [rafttoy](https://github.com/nvanbenschoten/rafttoy).
Half of the change here is to `etcd/raft` itself, which needs to be adapted to support
asynchronous log writes. These changes are presented in nvanbenschoten/etcd@1d1fa32.
The other half of the change is extracting a Raft log writer component that
handles the process of asynchronously appending to a collection of Raft logs and
notifying individual replicas about the eventual durability of these writes.
This component is pretty basic and should probably be entirely rewritten, but it
gets the job done for the prototype.
The Raft log writer reveals an interesting dynamic where concurrency at this
level actually hurts performance because it leads to concurrent calls to sync
Pebble's WAL, which is less performant than having a single caller due to the
fact that Pebble only exposes a synchronous Sync API and coalesces all Sync
requests on to a single thread. An async Pebble Sync API would be valuable here.
See the comment in NewWriter for more details.
\### Benchmarks
```
name old ops/s new ops/s delta
kv0/enc=false/nodes=3/cpu=32 36.4k ± 5% 46.5k ± 5% +27.64% (p=0.000 n=10+10)
name old avg(ms) new avg(ms) delta
kv0/enc=false/nodes=3/cpu=32 5.26 ± 3% 4.14 ± 6% -21.33% (p=0.000 n=8+10)
name old p99(ms) new p99(ms) delta
kv0/enc=false/nodes=3/cpu=32 10.9 ± 8% 9.1 ±10% -15.94% (p=0.000 n=10+10)
```
These are compelling results. I haven't pushed on this enough to know whether
there's actually a throughput win here, or whether the fixed concurrency and
reduced average latency is just making it look like there is. `kv0bench` should
help answer that question.