-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,kv: faster intent resolution over a key range #66268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no tests yet -- I'll add after initial opinions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go, line 49 at r1 (raw file):
onlySeparatedIntents := false stats := cArgs.EvalCtx.GetMVCCStats() if stats.ContainsEstimates == 0 && stats.IntentCount == stats.SeparatedIntentCount {
This question is whether we should trust this or not
@tbg @nvanbenschoten
One way to strengthen this would be to additionally gate this on the long -running migration being written by @itsbilal having finished running across the whole cluster. The plan is for that to run soon after the cluster version is finalized at 21.2. It is a correctness requirement for that to not leave any interleaved intents, since we will rip out the code that looks for interleaved intents in 22.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see this. Clearly we want this, the only question in my mind is whether we can afford to have it wait until later (i.e. one release after the migration), but I don't think so since there isn't much (any, really?) complexity here that would be avoided later. I heavily lean towards "just" using the cluster version to gate this functionality, unless there are arguments for being more aggressive. It's just the boring, safe, option.
I haven't given this a detailed review yet, have focused on the specific question of gating the mechanism, and want to understand the comment below:
The percent-flushed={0,50} sometimes show a regression.
These are atypical cases where the SingleDelete to
remove the intent has not been flushed so the intent
Set and SingleDelete are both alive. This would be
a pathalogical case when the LSM is unhealthy. Even
then the regression is < 100%.
For my education, can you explain what scenario this is? In the benchmark you set up intents, so there aren't any single deletes in the dataset on top of which you repeatedly open the batches. MVCCResolveWriteIntentRange
is the thing writing the SingleDeletes
and once it's deleted something it has handled that intent and won't return in the same batch, right? I'm missing how you even care about where the SingleDelete
s go. Also, and this is unsurprising since I'm already lost at this point, I don't understand why this would be the case in practice in pathological LSMs.
Also, there is a +166% regression in the data you posted.
Reviewed 8 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @nvanbenschoten)
pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go, line 49 at r1 (raw file):
There's reason to be hesitant about using the stats for something that could affect correctness or availability. For a while we had our fair share of stats bugs where the numbers just weren't right. This is a few years back, and our testing has improved, but I don't know that future changes to the stats couldn't have introduced a silent, so far benign, problem. That said, I don't know of anything concrete here. This should just work. We could gather more evidence from CockroachCloud - if the stats were "wrong" in the sense that they don't agree with recomputation (but all of the replica have the same stats), the consistency checker issues a delta that patches them up. I think SREs could look for this log line
cockroach/pkg/kv/kvserver/replica_consistency.go
Lines 250 to 259 in 83fbe72
// We've found that there's something to correct; send an RecomputeStatsRequest. Note that this // code runs only on the lease holder (at the time of initiating the computation), so this work // isn't duplicated except in rare leaseholder change scenarios (and concurrent invocation of // RecomputeStats is allowed because these requests block on one another). Also, we're // essentially paced by the consistency checker so we won't call this too often. log.Infof(ctx, "triggering stats recomputation to resolve delta of %+v", results[0].Response.Delta) req := roachpb.RecomputeStatsRequest{ RequestHeader: roachpb.RequestHeader{Key: startKey}, }
which they shouldn't generally see outside of a few expected cases (we introduced AbortSpanBytes
, etc); none of which touch the IntentCount
field.
Still, there hasn't been much attention paid to the correctness of stats and so I wouldn't want to start relying on it if we don't have to. The cluster version seems like the obvious choice. Using stats would require additional motivation - such as avoiding a performance drop prior to the migration completing. I'd hope that we've gotten the perf overhead "close enough" to vanilla 21.1 to avoid issues here.
As an aside, if the version is active, but IntentCount
is not equal to SeparatedIntentCount
, something is amiss, so we should catch that at least in our internal testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The percent-flushed={0,50} sometimes show a regression.
These are atypical cases where the SingleDelete to
remove the intent has not been flushed so the intent
Set and SingleDelete are both alive. This would be
a pathalogical case when the LSM is unhealthy. Even
then the regression is < 100%.For my education, can you explain what scenario this is? In the benchmark you set up intents, so there aren't any single deletes in the dataset on top of which you repeatedly open the batches. MVCCResolveWriteIntentRange is the thing writing the SingleDeletes and once it's deleted something it has handled that intent and won't return in the same batch, right?
The SingleDeletes are indeed happening during intent resolution, but not for the part that the benchmark is timing, where we just discard the batch and don't measure the time to commit it (since we want to be able to repeatedly resolve the same intents). The setupKeysWithIntent
is writing each version with an intent and resolving all the intents except some of the latest versions that the benchmark iterations will work with. That is where the SingleDeletes are getting written to Pebble.
I'm missing how you even care about where the SingleDeletes go. Also, and this is unsurprising since I'm already lost at this point, I don't understand why this would be the case in practice in pathological LSMs.
The bad case is when all the SingleDeletes, Sets still exist in the LSM, since with the separated locks, which have the txnID in the key suffix, the latest live intent (i.e., without the SingleDelete) is probabilistically somewhere in the middle of all these keys which share the same roachpb.Key
. The iteration code (neither the one on master, or the optimized path here) can avoid iterating over them inside pebble.Iterator
, which is wasteful. The unoptimized path only needs to iterate over 1/2 of them, the ones preceding the still live intent, after which it seeks using nextKey
. The optimized path does a NextEngineKey
(since seeks generally are more expensive) and also encounters the deleted ones that are immediately after the live intent (hence the 166% regression). I have never seen this in practice in our tpcc or kv roachtests, but it occurs often in our unit tests since we have so little data that everything is still sitting in the memtable. setupKeysWithIntent
explicitly compensates for this by flushing (based on the numFlushedVersions
parameter). Another case where it can occur is if the LSM became unhealthy -- say lots of bytes accumulated in L0 (and the Set and SingleDelete were in different flushes to L0, hence in different L0 sstables).
Also, there is a +166% regression in the data you posted.
oops, I misread. Fixed.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation! I understand this now, though trying to corroborate it in setupKeysWithIntent
was challenging. That method has a ton of params and logic and perhaps some of your prose above can be turned into documentation for it. I have a feeling that this is the kind of method that might both be touched by someone else in the future and be challenging for them to understand the original motivations for.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 8 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go, line 49 at r1 (raw file):
Clearly we want this, the only question in my mind is whether we can afford to have it wait until later (i.e. one release after the migration), but I don't think so since there isn't much (any, really?) complexity here that would be avoided later. I heavily lean towards "just" using the cluster version to gate this functionality, unless there are arguments for being more aggressive.
I think it's clearer to you two than it is to me what the alternatives for a migration we have are. When we say "use the cluster version" here, do you mean wait until 22.1 to ensure that the long-running migration added in 21.2 will already be complete? Or do you mean wait until that long-running migration has completed in 22.1 and then switch over? The second option is what I would have expected, and I think it allows us to avoid any dependency on the correctness of MVCCStats.
Is there a downside to waiting on this long-running migration to complete? The only one I can think of is that we can't start using this fast-path on some ranges until all ranges have been migrated over. But is that much of a concern?
pkg/storage/mvcc.go, line 2788 at r2 (raw file):
// mvccResolveWriteIntent. The MVCCIterator is positioned lazily, only if // needed -- the fast path for intent resolution when a transaction is // committing does not need to position the MVCCIterator.
Is the iterator positioned if the transaction is committing and needs to change its commit timestamp from the provisional values existing timestamp? What about when a transaction is aborted? I believe the answer is "yes" and "yes", but it would be worth confirming and adding to this comment.
pkg/storage/mvcc.go, line 2884 at r2 (raw file):
} func mvccGetIntent(
Could we give this a comment?
pkg/storage/mvcc.go, line 3154 at r2 (raw file):
var unsafeNextKey MVCCKey var unsafeNextValue []byte if nextKey.IsValue() {
Is this just checking that lastKey's timestamp wasn't 0,1
? If so, it may be worth a comment that this is the common case.
pkg/storage/mvcc.go, line 3309 at r2 (raw file):
var putBuf *putBuffer // Exactly one of sepIter and mvccIter is non-nil.
Some commentary on which cases each is used would be nice.
pkg/storage/mvcc.go, line 3394 at r2 (raw file):
var ok bool if !key.IsValue() { // When we start allowing existence of multiple intents on the same key,
This was optimistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go, line 49 at r1 (raw file):
Or do you mean wait until that long-running migration has completed in 22.1 and then switch over? The second option is what I would have expected, and I think it allows us to avoid any dependency on the correctness of MVCCStats.
I assume you meant 21.2. Yes, that was the option I was thinking of, and I believe that was what Tobi was saying too. Waiting for 22.1 will not really save us much code -- the changes here are mostly for the optimized path, and there are only small tweaks to the unoptimized legacy path, and the latter is what we would delete in 22.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go, line 49 at r1 (raw file):
Previously, sumeerbhola wrote…
Or do you mean wait until that long-running migration has completed in 22.1 and then switch over? The second option is what I would have expected, and I think it allows us to avoid any dependency on the correctness of MVCCStats.
I assume you meant 21.2. Yes, that was the option I was thinking of, and I believe that was what Tobi was saying too. Waiting for 22.1 will not really save us much code -- the changes here are mostly for the optimized path, and there are only small tweaks to the unoptimized legacy path, and the latter is what we would delete in 22.1.
Yes, if 21.1 is active, there shouldn't (mustn't) be any more inline intents, so we can always take this path. I think that's the canonical choice here and looks like we all agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
I'll hold off on this PR until the separated intents migration is merged since I need to gate this fast path on that migration being complete.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, and @tbg)
pkg/storage/mvcc.go, line 2788 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is the iterator positioned if the transaction is committing and needs to change its commit timestamp from the provisional values existing timestamp? What about when a transaction is aborted? I believe the answer is "yes" and "yes", but it would be worth confirming and adding to this comment.
Correct. Done.
pkg/storage/mvcc.go, line 2884 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could we give this a comment?
Done
pkg/storage/mvcc.go, line 3154 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this just checking that lastKey's timestamp wasn't
0,1
? If so, it may be worth a comment that this is the common case.
Done
pkg/storage/mvcc.go, line 3309 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Some commentary on which cases each is used would be nice.
Done
pkg/storage/mvcc.go, line 3394 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This was optimistic.
:)
Removed since it isn't relevant. The separated intents case already looks at each intent.
f465069
to
7a0f169
Compare
0b3ee28
to
14cef4d
Compare
I've updated this with tests, and to use the post migration cluster version. PTAL |
4e9a819
to
97159cc
Compare
@nvanbenschoten @itsbilal gentle ping on this -- I'd like to merge this early in the stability period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks great to me! minor comments mostly just about comments
Reviewed 13 of 14 files at r4, 3 of 3 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/replica_write.go, line 757 at r5 (raw file):
// The lock spans are insufficient for ranged intent resolution, which // does not declare lock spans and directly calls // spanSetBatch.NewEngineIterator. TODO(sumeer): we can't keep adding
TODO should be on a new line.
pkg/kv/kvserver/store.go, line 237 at r5 (raw file):
// This is before PostSeparatedIntentsMigration, so we may have // interleaved intents. version = clusterversion.ByKey(clusterversion.SeparatedIntentsMigration)
Why not do version = clusterversion.ByKey(clusterversion.SeparatedIntentsMigration - 1)
so it's clear the migration hasn't run? Makes little difference though, but just makes the purpose clearer. I can see a cluster version at clusterversion.SeparatedIntentsMigration
being seen as "all live replicas have been migrated" while in reality we could still be writing interleaved intents with the knob in tests (a criticism that applies more with the other instance of this line below).
pkg/kv/kvserver/batcheval/cmd_resolve_intent_test.go, line 277 at r5 (raw file):
// Before PostSeparatedIntentsMigration, so intent resolution will // not assume only separated intents. version = clusterversion.ByKey(clusterversion.SeparatedIntentsMigration)
Same
pkg/storage/mvcc.go, line 3324 at r5 (raw file):
var putBuf *putBuffer // Exactly one of sepIter and mvccIter is non-nil. sepIter is used when // onlySeparatedIntents=true, and mvccIter when false. The former allows for
sepIter is used when onlySeparatedIntents=true and rw provides consistent iterators
right? maybe the latter is not important if only pebble snapshots/batches will be passed in, and never an engine.
pkg/storage/mvcc.go, line 3336 at r5 (raw file):
// ConsistentIterators() since we want the two iterators to be mutually // consistent (and production code will have consistent iterators). // TODO(sumeer): when removing the slow path, use
Nit: empty line before TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 14 files at r4, 3 of 3 files at r5.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 463 at r5 (raw file):
// Some tests have st == nil. if st != nil { onlySeparatedIntents = !st.Version.ActiveVersionOrEmpty(ctx).Less(
Are you intentionally avoiding IsActive
? I have the same question elsewhere.
pkg/kv/kvserver/batcheval/cmd_resolve_intent_test.go, line 228 at r5 (raw file):
rir.EndKey = endKey declareKeysResolveIntentRange(&desc, h, &rir, &spans, nil)
I know this isn't all new, but why are we declaring keys in this test? Is that needed?
pkg/storage/mvcc.go, line 3403 at r5 (raw file):
} } else { mvccIter.SeekGE(nextKey)
It's a little strange to me as a reader where the two iterators are seeked. sepIter
is seeked before the loop and then at the end of each iteration. mvccIter
is seeked at the beginning of each iteration. Would it be worth reconciling this difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @itsbilal and @nvanbenschoten)
pkg/kv/kvserver/replica_write.go, line 757 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
TODO should be on a new line.
Done
pkg/kv/kvserver/store.go, line 237 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Why not do
version = clusterversion.ByKey(clusterversion.SeparatedIntentsMigration - 1)
so it's clear the migration hasn't run? Makes little difference though, but just makes the purpose clearer. I can see a cluster version atclusterversion.SeparatedIntentsMigration
being seen as "all live replicas have been migrated" while in reality we could still be writing interleaved intents with the knob in tests (a criticism that applies more with the other instance of this line below).
Done
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 463 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Are you intentionally avoiding
IsActive
? I have the same question elsewhere.
Silliness. Fixed here and elsewhere.
pkg/kv/kvserver/batcheval/cmd_resolve_intent_test.go, line 228 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I know this isn't all new, but why are we declaring keys in this test? Is that needed?
They are being used by spanSetBatch
that we created earlier. We are just initializing the spans later, but before doing reads and writes. I've added a comment.
pkg/kv/kvserver/batcheval/cmd_resolve_intent_test.go, line 277 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Same
Done
pkg/storage/mvcc.go, line 3324 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
sepIter is used when onlySeparatedIntents=true and rw provides consistent iterators
right? maybe the latter is not important if only pebble snapshots/batches will be passed in, and never an engine.
Updated the comment. We almost never use an engine for production paths. But tests often do.
pkg/storage/mvcc.go, line 3336 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Nit: empty line before TODO
Done
pkg/storage/mvcc.go, line 3403 at r5 (raw file):
sepIter is seeked before the loop and then at the end of each iteration. mvccIter is seeked at the beginning of each iteration. Would it be worth reconciling this difference?
I've added some code comments to explain the difference. The main reason is that we step the sepIter
and only seek it once, so we do the seek before the loop starts. The calls to sepIter.nextEngineKey
need to happen at the end of an iteration to prepare for the next iteration. This is important for performance. While mvccIter
is always being seeked, so we can consistently do it at the beginning of the iteration.
0dfdc68
to
f2ee987
Compare
I can't reproduce the |
04763d1
to
7ff0b9e
Compare
I've added a randomized ranged intent resolution test to compare the results using the fast and slow paths. |
When all the intents on a range are known to be separated, as indicated by MCCStats, and a ResolveIntentRangeRequest needs to be processed (for a transaction that did enough writes that we did not track individual intent keys), we avoid iterating over MVCC key-value pairs to find those intents, and instead iterate over the separated lock table key space. A new iterForKeyVersions interface is introduced which provides a narrow iterator interface to be used by mvccResolveWriteIntent that resolves individual intents. This interface is only for iterating over a single key, and is implemented by the usual MVCCIterator and by the new separatedIntentAndVersionIter. The latter serves both as a way to find intents over a range and for individual intent resolution, in an optimized manner. There are new benchmarks which vary the sparseness (sparseness of N means that 1 of every N key has an intent that is for the txn for which intent resolution is being performed) and whether the remaining N-1 keys have an intent from a different transaction (other-txn-intents=true) or not (the former results in the cost of at least parsing the MVCCMetadata to compare txn IDs). Full results are below. Two things to note are: - The percent-flushed={0,50} sometimes show a regression. These are atypical cases where the SingleDelete to remove the intent has not been flushed so the intent Set and SingleDelete are both alive. This would be a pathalogical case when the LSM is unhealthy. Even then the regression is < 170%. - The percent-flushed=100 cases are the typical one. Once sparseness increases to 100 or 1000 we see speed gains close to 100x, as indicated by the following: ``` IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=100-16 235µs ± 2% 58µs ± 3% -75.52% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=100-16 154µs ± 2% 2µs ± 4% -98.72% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=100-16 137µs ± 1% 1µs ± 2% -99.07% (p=0.008 n=5+5) ``` Full results: ``` name old time/op new time/op delta IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=0-16 450µs ± 5% 234µs ± 2% -48.07% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=50-16 294µs ± 3% 101µs ± 4% -65.49% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=100-16 235µs ± 2% 58µs ± 3% -75.52% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=0-16 494µs ± 7% 218µs ± 6% -55.84% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=50-16 251µs ± 2% 57µs ± 3% -77.29% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=100-16 154µs ± 2% 2µs ± 4% -98.72% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=0-16 343µs ± 5% 236µs ± 2% -31.28% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=50-16 240µs ± 1% 74µs ± 2% -69.02% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=100-16 203µs ± 1% 30µs ± 1% -84.97% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=0-16 516µs ± 4% 274µs ±13% -46.91% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=50-16 256µs ± 2% 57µs ± 1% -77.67% (p=0.016 n=4+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=100-16 137µs ± 1% 1µs ± 2% -99.07% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=0-16 325µs ± 2% 237µs ± 1% -26.98% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=50-16 232µs ± 4% 72µs ± 1% -68.89% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=100-16 199µs ± 1% 30µs ± 0% -84.87% (p=0.016 n=5+4) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=0-16 2.08ms ± 1% 2.68ms ± 8% +28.57% (p=0.016 n=4+5) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=50-16 592µs ± 4% 596µs ± 7% ~ (p=1.000 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=100-16 242µs ± 2% 59µs ± 3% -75.73% (p=0.016 n=4+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=0-16 3.04ms ± 5% 2.60ms ± 3% -14.59% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=50-16 840µs ± 4% 539µs ± 3% -35.87% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=100-16 163µs ± 6% 2µs ± 6% -98.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=0-16 936µs ±38% 2497µs ±12% +166.83% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=50-16 281µs ±12% 551µs ± 2% +95.64% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=100-16 209µs ±13% 31µs ± 9% -84.95% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=0-16 3.04ms ± 6% 2.55ms ± 0% ~ (p=0.333 n=5+1) name old alloc/op new alloc/op delta IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=0-16 12.8kB ± 0% 13.0kB ± 0% +0.92% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=50-16 12.8kB ± 0% 13.0kB ± 0% +0.94% (p=0.016 n=5+4) IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=100-16 12.9kB ± 0% 13.0kB ± 0% +0.59% (p=0.016 n=5+4) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=0-16 1.75kB ± 0% 0.29kB ± 0% ~ (p=0.079 n=4+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=50-16 1.75kB ± 0% 0.29kB ± 0% -83.18% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=100-16 1.75kB ± 0% 0.29kB ± 0% -83.22% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=0-16 12.8kB ± 0% 11.4kB ± 0% -11.37% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=50-16 12.8kB ± 0% 11.4kB ± 0% ~ (p=0.079 n=4+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=100-16 12.8kB ± 0% 11.4kB ± 0% -11.41% (p=0.000 n=5+4) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=0-16 1.66kB ± 0% 0.17kB ± 0% -89.61% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=50-16 1.66kB ± 0% 0.17kB ± 0% -89.61% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=100-16 1.66kB ± 0% 0.17kB ± 0% -89.63% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=0-16 12.8kB ± 0% 11.4kB ± 0% -11.50% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=50-16 12.8kB ± 0% 11.4kB ± 0% ~ (p=0.079 n=4+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=100-16 12.8kB ± 0% 11.4kB ± 0% -11.53% (p=0.029 n=4+4) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=0-16 12.8kB ± 0% 13.0kB ± 0% +0.95% (p=0.016 n=5+4) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=50-16 12.8kB ± 0% 13.0kB ± 0% +0.95% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=100-16 12.9kB ± 0% 13.0kB ± 0% +0.54% (p=0.016 n=5+4) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=0-16 1.75kB ± 0% 0.29kB ± 0% -83.14% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=50-16 1.75kB ± 0% 0.29kB ± 0% -83.19% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=100-16 1.75kB ± 0% 0.29kB ± 0% -83.22% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=0-16 12.8kB ± 0% 11.4kB ± 0% -11.37% (p=0.000 n=5+4) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=50-16 12.8kB ± 0% 11.4kB ± 0% -11.37% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=100-16 12.8kB ± 0% 11.4kB ± 0% -11.41% (p=0.000 n=5+4) IntentRangeResolution/separated=true/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=0-16 1.66kB ± 0% 0.17kB ± 0% -89.56% (p=0.000 n=5+1) name old allocs/op new allocs/op delta IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=0-16 401 ± 0% 404 ± 0% +0.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=50-16 401 ± 0% 404 ± 0% +0.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=100-16 401 ± 0% 404 ± 0% +0.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=0-16 104 ± 0% 8 ± 0% -92.31% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=50-16 104 ± 0% 8 ± 0% -92.31% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=100-16 104 ± 0% 8 ± 0% -92.31% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=0-16 401 ± 0% 305 ± 0% -23.94% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=50-16 401 ± 0% 305 ± 0% -23.94% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=100-16 401 ± 0% 305 ± 0% -23.94% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=0-16 102 ± 0% 3 ± 0% -97.06% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=50-16 102 ± 0% 3 ± 0% -97.06% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=100-16 102 ± 0% 3 ± 0% -97.06% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=0-16 401 ± 0% 303 ± 0% -24.44% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=50-16 401 ± 0% 303 ± 0% -24.44% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=100-16 401 ± 0% 303 ± 0% -24.44% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=0-16 401 ± 0% 404 ± 0% +0.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=50-16 401 ± 0% 404 ± 0% +0.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=100-16 401 ± 0% 404 ± 0% +0.75% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=0-16 104 ± 0% 8 ± 0% -92.31% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=50-16 104 ± 0% 8 ± 0% -92.31% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=100-16 104 ± 0% 8 ± 0% -92.31% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=0-16 401 ± 0% 305 ± 0% -23.94% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=50-16 401 ± 0% 305 ± 0% -23.94% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=100-16 401 ± 0% 305 ± 0% -23.94% (p=0.008 n=5+5) IntentRangeResolution/separated=true/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=0-16 102 ± 0% 3 ± 0% -97.06% (p=0.000 n=5+1) ``` Release note (performance improvement): Intent resolution for transactions that write many intents such that we track intent ranges, for the purpose of intent resolution, is much faster (potentially 100x) when using the separated lock table. Release justification: Fix for high-severity poor performance in existing functionality, and is a low risk, high benefit change, that potentially improves ranged intent resolution speed by 100x.
bors r+ |
Build succeeded: |
Closes #66900
When all the intents on a range are known to be separated,
as indicated by MCCStats, and a ResolveIntentRangeRequest
needs to be processed (for a transaction that did enough
writes that we did not track individual intent keys), we
avoid iterating over MVCC key-value pairs to find those
intents, and instead iterate over the separated lock table
key space.
A new iterForKeyVersions interface is introduced which
provides a narrow iterator interface to be used by
mvccResolveWriteIntent that resolves individual intents.
This interface is only for iterating over a single key,
and is implemented by the usual MVCCIterator and by the
new separatedIntentAndVersionIter. The latter serves
both as a way to find intents over a range and for
individual intent resolution, in an optimized manner.
There are new benchmarks which vary the sparseness
(sparseness of N means that 1 in every N keys has an
intent that is for the txn for which intent resolution
is being performed) and whether the remaining N-1 keys
have an intent from a different transaction
(other-txn-intents=true) or not (the former results
in the cost of at least parsing the MVCCMetadata to
compare txn IDs).
Full results are below. Two things to note are:
These are atypical cases where the SingleDelete to
remove the intent has not been flushed so the intent
Set and SingleDelete are both alive. This would be
a pathalogical case when the LSM is unhealthy. Even
then the regression is < 170%.
Once sparseness increases to 100 or 1000 we see speed
gains close to 100x, as indicated by the following:
Full results:
Release note (performance improvement): Intent resolution for transactions
that write many intents such that we track intent ranges, for the purpose
of intent resolution, is much faster (potentially 100x) when using the
separated lock table.
Release justification: Fix for high-severity poor performance in existing
functionality, and is a low risk, high benefit change, that potentially
improves ranged intent resolution speed by 100x.