-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanconfig/store: support applying a batch of updates #73150
Conversation
8f7c068
to
6bdecb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool change! Let's extend the testing a bit more to include deletes
in addition to sets
in batches, like I commented in-line, but other than that this feels close. 💯
Reviewed 15 of 17 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)
pkg/spanconfig/spanconfig.go, line 195 at r1 (raw file):
// StoreWriter is the write-only portion of the Store interface. type StoreWriter interface { // Apply applies the given update[1]. It also returns the existing spans that
Let's update this text to talk about multiple updates. It's also worth calling out on the interface that the set of updates must be non-overlapping.
pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go, line 338 at r1 (raw file):
s.mu.Lock() for _, ev := range events { s.mu.internal.Apply(ctx, false /* dryrun */, ev.(*bufferEvent).Update)
Add a TODO here to push all updated down into the Store
instead of applying them one by one so that we can get rid of the locking here (from the conversation we had a couple of weeks ago).
pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):
deleted, added, err := s.applyInternal(dryrun, updates...) if err != nil { log.Fatalf(ctx, "%v", err)
Any reason to fatal here instead of bubbling up the error to the caller?
pkg/spanconfig/spanconfigstore/store_test.go, line 60 at r1 (raw file):
// apply // set [c,e):C // delete [c,e)
This shouldn't be valid because the spans are overlapping, right? Let's update this comment with "real" inputs and outputs please.
pkg/spanconfig/spanconfigstore/store_test.go, line 216 at r1 (raw file):
updates[i] = getRandomUpdate() } sort.Slice(updates, func(i, j int) bool {
Instead of sorting here and then doing a linear scan to figure out overlaps, you can compare each span to figure out overlaps so you don't pass sorted input to Apply
.
pkg/spanconfig/spanconfigstore/testdata/batched/fractured, line 1 at r1 (raw file):
# Test semantics of batched updates that partially overlap with what's already
Should this be in the testdata/single
folder instead? You're only using batched updates for the test setup.
pkg/spanconfig/spanconfigstore/testdata/batched/overlapping, line 1 at r1 (raw file):
# Test semantics of batched updates that partially overlap with what's already
This is a pretty cool test! We should extend it to include some deletes in the batched updates as well (maybe in a separate file, for readability).
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):
# Test semantics of batched updates that partially overlap with what's already # present.
How're you viewing "fracturd" vs. "straddle"? Looks like they're setting up things differently, but I'm confused by the terminology.
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 32 at r1 (raw file):
# keys a b c d e f g h i j # state [--A--) [--B--) # set [---X-------)
Let's add another test where we set [c,h]: X
to make this span extend beyond on the right.
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 45 at r1 (raw file):
# keys a b c d e f g h i j # state [--A--) [--B--) # set [---X-------)
Similar to above, let's add a test where we set [a,f]: X
to make this span extend beyond on the left.
6bdecb2
to
d3bd68b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)
pkg/spanconfig/spanconfig.go, line 195 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Let's update this text to talk about multiple updates. It's also worth calling out on the interface that the set of updates must be non-overlapping.
Done.
pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber.go, line 338 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Add a TODO here to push all updated down into the
Store
instead of applying them one by one so that we can get rid of the locking here (from the conversation we had a couple of weeks ago).
Done.
pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Any reason to fatal here instead of bubbling up the error to the caller?
applyInternal
is the bubble-up-err version of this method that we're using in tests. Any errors here are data structure bugs, there's nothing the caller can do -- what's stored has either been corrupted or is being used incorrectly. Fatal-ing feels like the right thing to do.
pkg/spanconfig/spanconfigstore/store_test.go, line 60 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
This shouldn't be valid because the spans are overlapping, right? Let's update this comment with "real" inputs and outputs please.
Done.
pkg/spanconfig/spanconfigstore/store_test.go, line 216 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Instead of sorting here and then doing a linear scan to figure out overlaps, you can compare each span to figure out overlaps so you don't pass sorted input to
Apply
.
Added a Shuffle instead.
pkg/spanconfig/spanconfigstore/testdata/batched/overlapping, line 1 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
This is a pretty cool test! We should extend it to include some deletes in the batched updates as well (maybe in a separate file, for readability).
Done (added here for contrast with non-deletes).
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
How're you viewing "fracturd" vs. "straddle"? Looks like they're setting up things differently, but I'm confused by the terminology.
Renamed to "encompass". I was distinguishing between updates straddling two spans and updates that encompassed multiple spans wholly. It's a flimsy distinction, but presented as edge cases we needed to consider.
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 32 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Let's add another test where we
set [c,h]: X
to make this span extend beyond on the right.
Done.
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 45 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Similar to above, let's add a test where we
set [a,f]: X
to make this span extend beyond on the left.
Done.
pkg/spanconfig/spanconfigstore/testdata/batched/fractured, line 1 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should this be in the
testdata/single
folder instead? You're only using batched updates for the test setup.
Added some more batched ops instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)
pkg/spanconfig/spanconfig.go, line 262 at r2 (raw file):
// [1]: Unless dryrun is true. We'll still generate the same {deleted,added} // lists. // [2]: We could alternatively expose a GetAllOverlapping() API to make
I was re-reading your comment about "provided updates aren't no-ops" again and that feels like an unfortunate implementation detail leaking to the interface. It also doesn't feel great to include the same span in both the added and deleted list either. This is just musings for the future, but maybe we should consider a GetAllOverlapping
API at some point.
pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
applyInternal
is the bubble-up-err version of this method that we're using in tests. Any errors here are data structure bugs, there's nothing the caller can do -- what's stored has either been corrupted or is being used incorrectly. Fatal-ing feels like the right thing to do.
Yeah, but you would also fatal if the caller tries to Apply
overlapping spans. No strong feelings though.
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Renamed to "encompass". I was distinguishing between updates straddling two spans and updates that encompassed multiple spans wholly. It's a flimsy distinction, but presented as edge cases we needed to consider.
The distinction makes sense. Let's update the comment at the top of the file to be a bit more specific about this?
We first introduced spanconfig.StoreWriter in cockroachdb#70287. Here we extend the interface to accept a batch of updates instead of just one. type StoreWriter interface { Apply(ctx context.Context, dryrun bool, updates ...Update) ( deleted []roachpb.Span, added []roachpb.SpanConfigEntry, ) } The implementation is subtle -- we're not processing one update at a time. The semantics we're after is applying a batch of updates atomically on a consistent snapshot of the underlying store. This comes up in the upcoming spanconfig.Reconciler (cockroachdb#71994) -- there, following a zone config/descriptor change, we want to update KV state in a single request/RTT instead of an RTT per descendent table. The intended usage is better captured in the aforementioned PR; let's now talk about what this entails for the datastructure. To apply a single update, we want to find all overlapping spans and clear out just the intersections. If the update is adding a new span config, we'll also want to add the corresponding store entry after. We do this by deleting all overlapping spans in their entirety and re-adding the non-overlapping segments. Pseudo-code: for entry in store.overlapping(update.span): union, intersection = union(update.span, entry), intersection(update.span, entry) pre = span{union.start_key, intersection.start_key} post = span{intersection.end_key, union.end_key} delete {span=entry.span, conf=entry.conf} if entry.contains(update.span.start_key): # First entry overlapping with update. add {span=pre, conf=entry.conf} if non-empty if entry.contains(update.span.end_key): # Last entry overlapping with update. add {span=post, conf=entry.conf} if non-empty if adding: add {span=update.span, conf=update.conf} # add ourselves When extending to a set of updates, things are more involved. Let's assume that the updates are non-overlapping and sorted by start key. As before, we want to delete overlapping entries in their entirety and re-add the non-overlapping segments. With multiple updates, it's possible that a segment being re-added will overlap another update. If processing one update at a time in sorted order, we want to only re-add the gap between the consecutive updates. keyspace a b c d e f g h i j existing state [--------X--------) updates [--A--) [--B--) When processing [a,c):A, after deleting [b,h):X, it would be incorrect to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of re-adding the trailing segment right away, we carry it forward and process it when iterating over the second, possibly overlapping update. In our example, when iterating over [g,i):B we can subtract the overlap from [c,h):X and only re-add [c,g):X. It's also possible for the segment to extend past the second update. In the example below, when processing [d,f):B and having [b,h):X carried over, we want to re-add [c,d):X and carry forward [f,h):X to the update after (i.e. [g,i):C)). keyspace a b c d e f g h i j existing state [--------X--------) updates [--A--) [--B--) [--C--) One final note: we're iterating through the updates without actually applying any mutations. Going back to our first example, when processing [g,i):B, retrieving the set of overlapping spans would (again) retrieve [b,h):X -- an entry we've already encountered when processing [a,c):A. Re-adding non-overlapping segments naively would re-add [b,g):X -- an entry that overlaps with our last update [a,c):A. When retrieving overlapping entries, we need to exclude any that overlap with the segment that was carried over. Pseudo-code: carry-over = <empty> for update in updates: carried-over, carry-over = carry-over, <empty> if update.overlap(carried-over): # Fill in the gap between consecutive updates. add {span=span{carried-over.start_key, update.start_key}, conf=carried-over.conf} # Consider the trailing span after update; carry it forward if non-empty. carry-over = {span=span{update.end_key, carried-over.end_key}, conf=carried-over.conf} else: add {span=carried-over.span, conf=carried-over.conf} if non-empty for entry in store.overlapping(update.span): if entry.overlap(processed): continue # already processed union, intersection = union(update.span, entry), intersection(update.span, entry) pre = span{union.start_key, intersection.start_key} post = span{intersection.end_key, union.end_key} delete {span=entry.span, conf=entry.conf} if entry.contains(update.span.start_key): # First entry overlapping with update. add {span=pre, conf=entry.conf} if non-empty if entry.contains(update.span.end_key): # Last entry overlapping with update. carry-over = {span=post, conf=entry.conf} if adding: add {span=update.span, conf=update.conf} # add ourselves add {span=carry-over.span, conf=carry-over.conf} if non-empty We've extended the randomized testing suite to generate batches of updates at a time. We've also added a few illustrated datadriven tests. Release note: None
d3bd68b
to
ed14f74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)
pkg/spanconfig/spanconfig.go, line 262 at r2 (raw file):
Removed the no-op clause.
maybe we should consider a GetAllOverlapping API
There's that exact suggestion as a TODO below. But I don't think it ends up looking that bad even without it (#71994); I've come to prefer it actually. One peculiarity is that this read-only interface would intuitively sit on the StoreReader portion of the interface, but with no parallels in the sys-cfg span implementation. Feels awkward to include it in the StoreWriter? I dunno, bikeshed territory.
It also doesn't feel great to include the same span in both the added and deleted list either.
I'm not sure what you mean. The "added" list also includes fragments that were "re-added", because the updates only partially overlapped with existing spans -- the remainder need to be re-added. We actually need this property to interface with the KVAccessor.
pkg/spanconfig/spanconfigstore/store.go, line 139 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Yeah, but you would also fatal if the caller tries to
Apply
overlapping spans. No strong feelings though.
Yea, that would mean the caller's using it incorrectly, aka a bug. I'm not sure that error's "retryable" either, so I feel the fatal more appropriate.
pkg/spanconfig/spanconfigstore/testdata/batched/straddle, line 2 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
The distinction makes sense. Let's update the comment at the top of the file to be a bit more specific about this?
Done.
Build succeeded: |
Fixes cockroachdb#72389. Fixes cockroachdb#66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in cockroachdb#68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have cockroachdb#73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).
Fixes cockroachdb#72389. Fixes cockroachdb#66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in cockroachdb#68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have cockroachdb#73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).
77975: sql,cli: implement `\password` cli command r=ZhouXing19 a=ZhouXing19 This commit is to add support for the `\password` that securely changes the password for a user. fixes #48543 Release note (cli change): add support for `\password` cli command that enables secure alteration of the password for a user. The given password will always be pre-hashed with the password hash method obtained via the session variable `password-encryption`, i.e. `scram-sha-256` as the default hashing algorithm. 79700: spanconfig,kv: merge adjacent ranges with identical configs r=irfansharif a=irfansharif **spanconfig,kv: merge adjacent ranges with identical configs** Fixes #72389. Fixes #66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in #68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have #73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): ``` $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS ``` It's feasible that in the future we re-work this in favor of (c). --- **spanconfig/store: use templatized btree impl instead** ``` $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 431093 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 4308200 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 43827373 ns/op PASS $ benchstat old.txt new.txt # from previous commit name old time/op new time/op delta StoreComputeSplitKey/num-entries=10000-10 1.17ms ± 0% 0.43ms ± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=100000-10 12.4ms ± 0% 4.3ms ± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=1000000-10 136ms ± 0% 44ms ± 0% ~ (p=1.000 n=1+1) ``` --- **spanconfig/store: intern span configs** - Avoid the expensive proto.Equal() when computing split keys; - Reduce memory overhead of the data structure ``` $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 90323 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 915936 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 9575781 ns/op $ benchstat old.txt new.txt # from previous commit name old time/op new time/op delta StoreComputeSplitKey/num-entries=10000-10 431µs ± 0% 90µs ± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=100000-10 4.31ms ± 0% 0.92ms ± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=1000000-10 43.8ms ± 0% 9.6ms ± 0% ~ (p=1.000 n=1+1) ``` Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters). Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Fixes #72389. Fixes #66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in #68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have #73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).
We first introduced spanconfig.StoreWriter in #70287. Here we extend the
interface to accept a batch of updates instead of just one.
The implementation is subtle -- we're not processing one update at a
time. The semantics we're after is applying a batch of updates
atomically on a consistent snapshot of the underlying store. This comes
up in the upcoming spanconfig.Reconciler (#71994) -- there, following a
zone config/descriptor change, we want to update KV state in a single
request/RTT instead of an RTT per descendent table. The intended
usage is better captured in the aforementioned PR; let's now talk
about what this entails for the datastructure.
To apply a single update, we want to find all overlapping spans and
clear out just the intersections. If the update is adding a new span
config, we'll also want to add the corresponding store entry after. We
do this by deleting all overlapping spans in their entirety and
re-adding the non-overlapping segments. Pseudo-code:
When extending to a set of updates, things are more involved. Let's
assume that the updates are non-overlapping and sorted by start key. As
before, we want to delete overlapping entries in their entirety and
re-add the non-overlapping segments. With multiple updates, it's
possible that a segment being re-added will overlap another update. If
processing one update at a time in sorted order, we want to only re-add
the gap between the consecutive updates.
When processing [a,c):A, after deleting [b,h):X, it would be incorrect
to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of
re-adding the trailing segment right away, we carry it forward and
process it when iterating over the second, possibly overlapping update.
In our example, when iterating over [g,i):B we can subtract the overlap
from [c,h):X and only re-add [c,g):X.
It's also possible for the segment to extend past the second update. In
the example below, when processing [d,f):B and having [b,h):X carried
over, we want to re-add [c,d):X and carry forward [f,h):X to the update
after (i.e. [g,i):C)).
One final note: we're iterating through the updates without actually
applying any mutations. Going back to our first example, when processing
[g,i):B, retrieving the set of overlapping spans would (again) retrieve
[b,h):X -- an entry we've already encountered when processing [a,c):A.
Re-adding non-overlapping segments naively would re-add [b,g):X -- an
entry that overlaps with our last update [a,c):A. When retrieving
overlapping entries, we need to exclude any that overlap with the
segment that was carried over. Pseudo-code:
We've extended the randomized testing suite to generate batches of
updates at a time. We've also added a few illustrated datadriven tests.
Release note: None