Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: integrate raft async storage writes #94165

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Dec 22, 2022

Fixes #17500.
Epic: CRDB-22644

This commit integrates with the AsyncStorageWrites functionality that we added to Raft in etcd-io/raft/pull/8.

Approach

The commit makes the minimal changes needed to integrate with async storage writes and pull fsyncs out of the raft state machine loop. It does not make an effort to extract the non-durable portion of raft log writes or raft log application onto separate goroutine pools, as was described in #17500. Those changes will also be impactful, but they're non trivial and bump into a pipelining vs. batching trade-off, so they are left as future work items. See #94853 and #94854.

With this change, asynchronous Raft log syncs are enabled by the new DB.ApplyNoSyncWait Pebble API introduced in cockroachdb/pebble/pull/2117. The handleRaftReady state machine loop continues to initiate Raft log writes, but it uses the Pebble API to offload waiting on durability to a separate goroutine. This separate goroutine then sends the corresponding MsgStorageAppend's response messages where they need to go (locally and/or to the Raft leader) when the fsync completes. The async storage writes functionality in Raft makes this all safe.

Benchmark Results

The result of this change is reduced interference between Raft proposals. As a result, it reduces end-to-end commit latency.

etcd-io/raft/pull/8 presented a collection of benchmark results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average and tail latency. However, it doesn't provide the throughput improvements at the top end because log appends and state machine application have not yet been extracted into separate goroutine pools, which would facilitate an increased opportunity for batching.

To visualize the impact on latency, consider the following test. The experiment uses a 3-node GCP cluster with n2-standard-32 instances spread across three availability zones. It runs kv0 (write-only) against the cluster with 64-byte values. It then ramps up concurrency to compare throughput vs. average and tail latency.

NOTE: log scales on x and y axes

Throughput vs  average latency of write-only workload

Throughput vs  tail latency of write-only workload

Async storage writes impacts latency by different amounts at different throughputs, ranging from an improvement of 20% to 40% when the system is "well utilized". However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.

Throughput (B/s) Throughput (qps) Avg. Latency Δ p99 Latency Δ
63 KB/s 1,000 -10.5% -8.8%
125 KB/s 2,000 -7.1% -10.4%
250 KB/s 4,000 -20% -11.2%
500 KB/s 8,000 -16.6% -25.3%
1 MB/s 16,000 -30.8% -44.0%
2 MB/s 32,000 -38.2% -30.9%
4 MB/s 64,000 5.9% 9.4%

Other benchmark results

name                   old ops/s    new ops/s    delta
# 50% read, 50% update
ycsb/A/nodes=3          16.0k ± 5%   16.2k ± 4%     ~     (p=0.353 n=10+10)
ycsb/A/nodes=3/cpu=32   28.7k ± 5%   33.8k ± 2%  +17.57%  (p=0.000 n=9+9)
# 95% read, 5% update
ycsb/B/nodes=3          29.9k ± 3%   30.2k ± 3%     ~     (p=0.278 n=9+10)
ycsb/B/nodes=3/cpu=32    101k ± 1%    100k ± 3%     ~     (p=0.274 n=8+10)
# 100% read
ycsb/C/nodes=3          40.4k ± 3%   40.0k ± 3%     ~     (p=0.190 n=10+10)
ycsb/C/nodes=3/cpu=32    135k ± 1%    137k ± 1%   +0.87%  (p=0.011 n=9+9)
# 95% read, 5% insert
ycsb/D/nodes=3          33.6k ± 3%   33.8k ± 3%     ~     (p=0.315 n=10+10)
ycsb/D/nodes=3/cpu=32    108k ± 1%    106k ± 6%     ~     (p=0.739 n=10+10)
# 95% scan, 5% insert
ycsb/E/nodes=3          3.79k ± 1%   3.73k ± 1%   -1.42%  (p=0.000 n=9+9)
ycsb/E/nodes=3/cpu=32   6.31k ± 5%   6.48k ± 6%     ~     (p=0.123 n=10+10)
# 50% read, 50% read-modify-write
ycsb/F/nodes=3          7.68k ± 2%   7.99k ± 2%   +4.11%  (p=0.000 n=10+10)
ycsb/F/nodes=3/cpu=32   15.6k ± 4%   18.1k ± 3%  +16.14%  (p=0.000 n=8+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.01 ± 5%    5.95 ± 4%     ~     (p=0.460 n=10+10)
ycsb/A/nodes=3/cpu=32    5.01 ± 4%    4.25 ± 4%  -15.19%  (p=0.000 n=9+10)
ycsb/B/nodes=3           4.80 ± 0%    4.77 ± 4%     ~     (p=0.586 n=7+10)
ycsb/B/nodes=3/cpu=32    1.90 ± 0%    1.90 ± 0%     ~     (all equal)
ycsb/C/nodes=3           3.56 ± 2%    3.61 ± 3%     ~     (p=0.180 n=10+10)
ycsb/C/nodes=3/cpu=32    1.40 ± 0%    1.40 ± 0%     ~     (all equal)
ycsb/D/nodes=3           2.87 ± 2%    2.85 ± 2%     ~     (p=0.650 n=10+10)
ycsb/D/nodes=3/cpu=32    1.30 ± 0%    1.34 ± 4%     ~     (p=0.087 n=10+10)
ycsb/E/nodes=3           25.3 ± 0%    25.7 ± 1%   +1.38%  (p=0.000 n=8+8)
ycsb/E/nodes=3/cpu=32    22.9 ± 5%    22.2 ± 6%     ~     (p=0.109 n=10+10)
ycsb/F/nodes=3           12.5 ± 2%    12.0 ± 1%   -3.72%  (p=0.000 n=10+9)
ycsb/F/nodes=3/cpu=32    9.27 ± 4%    7.98 ± 3%  -13.96%  (p=0.000 n=8+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           45.7 ±15%    35.7 ± 6%  -21.90%  (p=0.000 n=10+8)
ycsb/A/nodes=3/cpu=32    67.6 ±13%    55.3 ± 5%  -18.10%  (p=0.000 n=9+10)
ycsb/B/nodes=3           30.5 ±24%    29.4 ± 7%     ~     (p=0.589 n=10+10)
ycsb/B/nodes=3/cpu=32    12.8 ± 2%    13.3 ± 7%     ~     (p=0.052 n=10+8)
ycsb/C/nodes=3           14.0 ± 3%    14.2 ± 0%     ~     (p=0.294 n=10+8)
ycsb/C/nodes=3/cpu=32    5.80 ± 0%    5.70 ± 5%     ~     (p=0.233 n=7+10)
ycsb/D/nodes=3           12.4 ± 2%    11.7 ± 3%   -5.32%  (p=0.001 n=10+10)
ycsb/D/nodes=3/cpu=32    6.30 ± 0%    5.96 ± 6%   -5.40%  (p=0.001 n=10+10)
ycsb/E/nodes=3           81.0 ± 4%    83.9 ± 0%   +3.63%  (p=0.012 n=10+7)
ycsb/E/nodes=3/cpu=32     139 ±19%     119 ±12%  -14.46%  (p=0.021 n=10+10)
ycsb/F/nodes=3            122 ±17%     103 ±10%  -15.48%  (p=0.002 n=10+8)
ycsb/F/nodes=3/cpu=32     146 ±20%     133 ± 7%   -8.89%  (p=0.025 n=10+10)

The way to interpret these results is that async raft storage writes reduce latency and, as a result of the closed loop natured workload, also increase throughput for the YCSB variants that perform writes and aren't already CPU saturated. Variants that are read-only are unaffected. Variants that are CPU-saturated do not benefit from the change because they are already bottlenecked on CPU resources and cannot push any more load (see above).


Release note (performance improvement): The Raft proposal pipeline has been optimized to reduce interference between Raft proposals. This improves average and tail write latency at high concurrency.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch 3 times, most recently from 69c0ef7 to 9faec10 Compare December 23, 2022 18:13
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch from 9faec10 to 7f77d1a Compare January 6, 2023 18:00
@nvanbenschoten nvanbenschoten marked this pull request as ready for review January 7, 2023 02:55
@nvanbenschoten nvanbenschoten requested review from a team as code owners January 7, 2023 02:55
@nvanbenschoten nvanbenschoten requested a review from a team January 7, 2023 02:55
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner January 7, 2023 02:55
@nvanbenschoten
Copy link
Member Author

@tbg @sumeerbhola this is now ready for review! There's one outstanding issue around sideloaded SST cleanup, but we can address that concurrent with the rest of the review.

@sumeerbhola this PR is going to require cockroachdb/pebble#2117. Were you planning to push that along in the coming weeks?

@tbg
Copy link
Member

tbg commented Jan 10, 2023

However, it increases latency by 5% to 10% when the system is over-saturated and CPU bound, presumably because of the extra goroutine handoff to the log append fsync callback, which will be impacted by elevated goroutine scheduling latency.

Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead? One thing I hope can fall out of this change is more graceful degradation under I/O overload since we won't be blocking the raft handling loop for extended periods of times.

pkg/kv/kvserver/replica.go Outdated Show resolved Hide resolved
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Need to better understand the leading raft entry cache issue and the sideloading truncation.

pkg/kv/kvserver/replica.go Show resolved Hide resolved
pkg/kv/kvserver/logstore/sync_waiter.go Show resolved Hide resolved
pkg/kv/kvserver/logstore/sync_waiter.go Show resolved Hide resolved
pkg/kv/kvserver/logstore/sync_waiter.go Show resolved Hide resolved
pkg/kv/kvserver/logstore/sync_waiter.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_raft.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_raft.go Show resolved Hide resolved
pkg/kv/kvserver/replica_raft.go Outdated Show resolved Hide resolved
pkg/storage/engine.go Outdated Show resolved Hide resolved
pkg/storage/pebble_batch.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of drive-by nits

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 3 of 10 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/logstore/logstore.go line 130 at r3 (raw file):

// depending on the kv.raft_log.non_blocking_synchronization.enabled cluster
// setting. Either way, the effects of the log append will be immediately
// visible readers of the Engine.

nit: visible to readers ...


pkg/storage/engine.go line 965 at r1 (raw file):

	Commit(sync bool) error
	// CommitNoSyncWait atomically applies any batched updates to the underlying
	// engine and initiates a synchronous disk write, but does not wait for that

nit: initiates a disk write ...
(presumably it doesn't matter how that disk write is implemented -- async or sync)

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/storage/engine.go line 965 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: initiates a disk write ...
(presumably it doesn't matter how that disk write is implemented -- async or sync)

Done.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 25, 2023
Extracted from cockroachdb#94165.

Picks up github.com/cockroachdb/pebble/pull/2117.

Release note: None
Epic: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead?

Do you have any particular cases or test setups in mind? I'm happy to run more experiments so we can better understand the impact of this change.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/replica_raft.go line 818 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

state.LastIndex can now contain non-durable entries? Seems sus

Good point. As of etcd-io/raft#8, we no longer allow application of unstable entries. Instead, we wait for them to stabilize locally before handing them in a Ready to be applied. I removed the handling of this case in a newly added commit.


pkg/kv/kvserver/replica_raft.go line 949 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what's the intended purpose of this message? Should we print it unconditionally on l.947?

The purpose is to add a trace event to the context attached to each committed entry, before applying. If we don't have any entries, then the call is a no-op. However, it does RLock the Replica either way, so we might as well make it conditional on there actually being entries to apply.


pkg/kv/kvserver/replica_raft.go line 1459 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

These are the append thread (and maybe later apply) responding to async work, right? A little comment would be helpful.

Done.


pkg/kv/kvserver/replica_raft.go line 1575 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

and?

Done.


pkg/kv/kvserver/logstore/logstore.go line 130 at r3 (raw file):

Previously, sumeerbhola wrote…

nit: visible to readers ...

Done.


pkg/kv/kvserver/logstore/logstore.go line 223 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe this dance between things we must syncs and refusing to sync due to the disableSyncRaftLog setting could a bit clearer if we renamed mustSync -> wantsSync and sync -> willSync. Just a thought, take it or leave it.

Done.


pkg/kv/kvserver/logstore/logstore.go line 256 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

shouldn't this be mustSync? We are only recording to RaftLogCommitLatency if mustSync and it makes sense for those two to be aligned. I think even when the unsafe setting is active, then we should be recording to this histogram and "pretend" we've synced whenever it was requested. I.e. have the code be all the same, except we don't actually sync.

Done. I had come to a similar conclusion but decided not to change the behavior. But happy to switch.


pkg/kv/kvserver/logstore/logstore.go line 262 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe we should detect early in this method whether we're replacing a log tail, and if so disable the pipelining. It's not a common code path so not worth optimizing for.

Thanks, this was a good suggestion.


pkg/kv/kvserver/logstore/logstore.go line 285 at r3 (raw file):
I added some words here, on replicaRaftStorage.Entries, and on r.mu.lastIndex. It is ok for the raft entry cache to lead the durable raft log, as long as it does not lead the pebble engine.

etcd/raft maintains its own tracking of entry durability by splitting its log into an unstable portion for entries that are not known to be durable and a stable portion for entries that are known to be durable. Its boundary of "stability" is defined by raftLog.unstable.offset, and it advances this boundary when entries become durable. It then overlays its unstable log over the raft.Storage interface, only ever querying raft.Storage below the unstable offset. See raftLog.firstIndex and raftLog.slice. So returning unstable entries from raft.Storage is allowed.

This is more than just happenstance. It's important that etcd/raft makes decisions by looking at the unstable log overlayed over the raft.Storage. That's because any entry beyond unstable.offset could be uncommitted, and therefore could be replaced by an in-progress append. Before async storage writes, this was less common, but it was still allowed by the etcd/raft library to Step the raft state machine between a Ready() and an Advance() (though crdb didn't use this flexibility). For etcd/raft to properly synchronize log appends with other state transitions (e.g. campaigning), it needed to make sure that the other state transitions always saw the latest version of the local log, unstable or not.

The raft entry cache backs the raft.Storage implementation and raft assumes everything in it is durable.

Raft shouldn't be assuming that everything in it is durable. Even before this change, that's not the case, because replicaRaftStorage.Entries read with StandardDurability instead of GuaranteedDurability. So it will read out of LSM memtables even before the corresponding LSM writes are stable. But again, this should be ok.

Couldn't this lead to situations in which a replica campaigns using a log that isn't actually durable yet, and may distribute log entries not actually stable yet believing them to be stable?

It can lead to this situation, but is this a problem? If the candidate is elected and then crashes before the suffix of its log (that it campaigned on) is stable, then it cannot have voted for itself in the election or have voted for any new log entries after getting raft leadership. So if it was able to get any log entries committed to the replicated log, a quorum of other replicas must have durably appended them to their log.


pkg/kv/kvserver/logstore/sync_waiter.go line 34 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mention the invariant that callbacks are notified in completion order, i.e. if a callback completes it is true that all previous operations have also succeeded.

Done, though I did not mention that the one callback implies the completion of all previous operations because users of this loop should not need to think about the relationship between different syncs. I think it's also implied by the invariant you suggested below.


pkg/kv/kvserver/logstore/sync_waiter.go line 35 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Invariant: callbacks are invoked in order and without concurrency.

Done.


pkg/kv/kvserver/logstore/sync_waiter.go line 78 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, to avoid fsyncgate we'd have to hold on to a local var err error and make sure that if one caller gets an error, they all get an error. (i.e. basically poisoning the loop). So maybe fataling is fine; if the disk is broken the disk is broken.

Yeah, I think Fatal-ing is the safest bet to handle errors here. In addition to what you said, errors returned from SyncWait should only be due to invariant violations and never due to unrecoverable Pebble states. See the discussion in https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1674678349690899 and the TODOs added in #95862.


pkg/kv/kvserver/logstore/sync_waiter.go line 90 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// It may never be called in case the stopper stops.

Done.


pkg/kv/kvserver/logstore/sync_waiter.go line 98 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

do you want to pool these?

I don't think there's a need. They shouldn't allocate because the channel has a chan syncBatch type. Notice that the contained type is concrete, so there's no indirection or boxing.

To demonstrate, I added BenchmarkSyncWaiterLoop, which allocates 0 times per iteration:

name               time/op
SyncWaiterLoop-10  375ns ± 4%

name               alloc/op
SyncWaiterLoop-10  0.00B

name               allocs/op
SyncWaiterLoop-10   0.00

pkg/kv/kvserver/logstore/sync_waiter.go line 103 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This needs a log.EveryN. Also, is this really a warning? Actual question, not a rhetorical one.

Done. It is a warning because it's unexpected. We size the enqueue channel with enough capacity to hold as many in-progress sync operations as Pebble allows (pebble/record.SyncConcurrency). I left a comment about this.


pkg/kv/kvserver/logstore/sync_waiter_test.go line 48 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Wouldn't it not block anyway? You have a q channel that presumably has some capacity, so maybe this should be done 2*cap(chan) times?

Done.

@nvanbenschoten
Copy link
Member Author

I added two new commits that introduce randomness into message delivery order in tests to ensure that re-ordering of local raft message responses is tolerated and handled properly.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New commits LGTM, I still need to give parts of the main commit a second read since it's been a while.
Did you make any changes in it other than the rebase?

Reviewed 2 of 10 files at r3, 17 of 17 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica.go line 405 at r4 (raw file):

		// will have to be retrieved from the Raft log entry. Use the
		// invalidLastTerm constant for this case.
		lastIndex, lastTerm uint64

This is a semantic change, right? This was previously durable. You've convinced me (in another comment thread) that it's fine that this is getting picked up by replicaRaftStorage. And when we read pebble, we'll always see the lastIndex entry with corresponding lastTerm as well, so I hope any other uses of this field would also be ok. I sleuthed a bit and there's

  • raft log queue (should be ok since it lowers to committed index anyway)
  • quiescing (ok)

It might still be worth to rename these inflightLastIndex and inflightLastTerm (not in love with inflight, maybe you can think of something better) because I think there is the risk of someone using them like they were durable. I cede that this is likely an imaginary fear but until we encapsulate these variables away from the "general public" I'd like to nurse that fear a little bit.


pkg/kv/kvserver/replica_raft.go line 1617 at r7 (raw file):

		}
		// NB: we can reset messages in the localMsgs.recycled slice without holding
		// the localMsgs mutex because no-one ever writes to localMsgs.recycled abd

if you catch another CI cycle anyway, abd

Fixes cockroachdb#17500.
Waiting on github.com/cockroachdb/pebble/pull/2117.

This commit integrates with the `AsyncStorageWrites` functionality that
we added to Raft in github.com/etcd-io/raft/pull/8.

\## Approach

The commit makes the minimal changes needed to integrate with async
storage writes and pull fsyncs out of the raft state machine loop. It
does not make an effort to extract the non-durable portion of raft log
writes or raft log application onto separate goroutine pools, as was
described in cockroachdb#17500. Those changes will also be impactful, but they're
non trivial and bump into a pipelining vs. batching trade-off, so they
are left as future work items (TODO(nvanbenschoten): open new issues).

With this change, asynchronous Raft log syncs are enabled by the new
`DB.ApplyNoSyncWait` Pebble API introduced in github.com/cockroachdb/pebble/pull/2117.
The `handleRaftReady` state machine loop continues to initiate Raft log
writes, but it uses the Pebble API to offload waiting on durability to a
separate goroutine. This separate goroutine then sends the corresponding
`MsgStorageAppend`'s response messages where they need to go (locally
and/or to the Raft leader) when the fsync completes. The async storage
writes functionality in Raft makes this all safe.

\## Benchmark Results

The result of this change is reduced interference between Raft
proposals. As a result, it reduces end-to-end commit latency.

github.com/etcd-io/raft/pull/8 presented a collection of benchmark
results captured from integrating async storage writes with rafttoy.

When integrated into CockroachDB, we see similar improvements to average
and tail latency. However, it doesn't provide the throughput
improvements at the top end because log appends and state machine
application have not yet been extracted into separate goroutine pools,
which would facilitate increased opportunity for batching.

TODO: add images

----

Release note (performance improvement): The Raft proposal pipeline
has been optimized to reduce interference between Raft proposals.
This improves average and tail write latency at high concurrency.
…previous LastIndex

This commit removes logic in `apply.Task.AckCommittedEntriesBeforeApplication`
which was ensuring that we don't acknowledge committed Raft log entries before
they were durable in a node's local Raft log. This is now ensured inside of the
etcd/raft library when AsyncStorageWrites is enabled, as the comment added here
describes.
This commit adds a testing facility to randomly disable non-blocking log syncs
in tests. This ensures that we properly handle the case where non-blocking log
sync is disabled while some log syncs are still in-progress.
This commit adds a testing facility to shuffle local raft messages before
delivering them to the raft state machine. These are not required to be in
order, so we shuffle in tests ensure that re-ordering is handled properly.
Simple rename to make the semantics of these fields more clear and
harder to misinterpret.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/asyncRaftLogSimple branch from 1c2135f to ebc33fa Compare February 2, 2023 22:49
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make any changes in it other than the rebase?

Just a rebase and addressing your comments. Nothing substantial beyond your suggested change to skip non-block log writes when a MsgStorageAppend is replacing existing log entries (in logstore.go).

I'm running this PR through the roachtest suite here to get a bit more confidence in it before we'll feel good about merging.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/replica.go line 405 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a semantic change, right? This was previously durable. You've convinced me (in another comment thread) that it's fine that this is getting picked up by replicaRaftStorage. And when we read pebble, we'll always see the lastIndex entry with corresponding lastTerm as well, so I hope any other uses of this field would also be ok. I sleuthed a bit and there's

  • raft log queue (should be ok since it lowers to committed index anyway)
  • quiescing (ok)

It might still be worth to rename these inflightLastIndex and inflightLastTerm (not in love with inflight, maybe you can think of something better) because I think there is the risk of someone using them like they were durable. I cede that this is likely an imaginary fear but until we encapsulate these variables away from the "general public" I'd like to nurse that fear a little bit.

That's all fair. It would be an issue if a new user of these fields misinterpreted their semantics and assumed that they are talking about durability. I've updated the names to lastIndexNotDurable and lastTermNotDurable.


pkg/kv/kvserver/replica_raft.go line 1617 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

if you catch another CI cycle anyway, abd

Done.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Feb 3, 2023

I'm running this PR through the roachtest suite here to get a bit more confidence in it before we'll feel good about merging.

We saw 14 individual roachtest failures, none of which look related and all of which have corresponding github tracking issues on master (4 were fixed yesterday). CI and roachtests are all looking good for this change.

@tbg tbg self-requested a review February 3, 2023 15:51
@nvanbenschoten
Copy link
Member Author

TFTR!

bors r+

@tbg
Copy link
Member

tbg commented Feb 3, 2023

Do you have a / would you be able to run a flavor of this experiment that is I/O bound instead?

Do you have any particular cases or test setups in mind? I'm happy to run more experiments so we can better understand the impact of this change.

Essentially I'm hoping that by moving the log writes out of the critical path of the raft ready loop, we'll somehow do "better" under I/O overload - lower handle raft ready latencies, less build-up in the raft receive queues, and so on.

An apples-to-apples comparison could be helpful here - run a workload where we eventually see LSM inversion under both sync and async log appends, and simply investigate the differences.

It just so happens that @andrewbaptist is looking at such an experiment1.

Footnotes

  1. https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1675371760534349

@craig
Copy link
Contributor

craig bot commented Feb 3, 2023

Build failed:

@nvanbenschoten
Copy link
Member Author

The test failure looks real, but I'm also able to hit it on master and at least as far back as 03a5c40. We seem to be leaking tracing spans of Raft proposals during server shutdown in some cases. I'll continue to investigate on the side.

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 3, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: make disk I/O asynchronous with respect to Raft state machine
4 participants