kv: collapse adjacent span configs to reduce number of necessary splits #72389

irfansharif · 2021-11-03T15:36:16Z

After #67679 + enabling the infrastructure by default, we'll get to a place where we'll split on every schema object boundary for all tenants (including, as before, the host tenant). That's mighty expensive and comes with a lot of overhead and is a scalability bottleneck for how many tenants we can pack into the same cluster. #70555 proposes some rudimentary guardrails, but on the KV side we can do better. The spanconfig.Store introduced in #70287 can be augmented to recognize that certain span config entries are adjacent to one another (S1.EndKey == S2.Key) and have the same configs. We expect the majority of span configs to be applicable here, given almost everything inherits from RANGE DEFAULT. When recognizing such an adjacency, we can avoid unconditionally splitting on that boundary. This should let us claw back a low range count with secondary tenants while still enabling multi-tenant zone configs.

Some neighboring issues:

config,kvserver: don't unconditionally split on table boundaries for the host tenant #66063 (this same optimization can be applied to the host tenant)
sql: support manual splits on table boundaries for secondary tenants #65903 (might still be desirable to induce manual splits on table boundaries, for say, perf reasons)
kv,*: non-contiguous ranges #65726 (orthogonal issue describing a scheme to reduce the total number of ranges in the system)

Zendesk Ticket IDs
12099

Jira issue: CRDB-11130

irfansharif · 2022-04-07T21:13:13Z

We should just do this for 22.1. It's not that difficult and would alleviate many of the concerns around thundering-splits-on-upgrades for multi-tenant clusters. I was originally thinking to coalesce adjacent ranges for tenant tables, leaving system tenant ranges split along table boundaries, but now that we can have >100k tables easily, perhaps gating it behind a cluster setting is better.

irfansharif · 2022-04-08T18:41:49Z

Prototyped this a while ago in #68491.

Fixes cockroachdb#72389. Fixes cockroachdb#66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in cockroachdb#68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have cockroachdb#73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).

Fixes #72389. Fixes #66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in #68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have #73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).

irfansharif added A-multitenancy Related to multi-tenancy A-zone-configs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Nov 3, 2021

irfansharif mentioned this issue Nov 3, 2021

kvserver: limit the number of kv spans configurable by a tenant #70555

Closed

cockroachdb deleted a comment from blathers-crl bot Nov 5, 2021

irfansharif mentioned this issue Dec 15, 2021

spanconfig: harden infrastructure for v22.1 #73874

Closed

24 tasks

irfansharif self-assigned this Apr 7, 2022

irfansharif mentioned this issue Apr 7, 2022

config,kvserver: don't unconditionally split on table boundaries for the host tenant #66063

Closed

irfansharif mentioned this issue Apr 8, 2022

spanconfig,kv: merge adjacent ranges with identical configs #79700

Merged

craig bot closed this as completed in 294e548 Apr 29, 2022

blathers-crl bot mentioned this issue Apr 29, 2022

release-22.1: spanconfig,kv: merge adjacent ranges with identical configs #80811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: collapse adjacent span configs to reduce number of necessary splits #72389

kv: collapse adjacent span configs to reduce number of necessary splits #72389

irfansharif commented Nov 3, 2021 •

edited by celiala

Loading

irfansharif commented Apr 7, 2022

irfansharif commented Apr 8, 2022

kv: collapse adjacent span configs to reduce number of necessary splits #72389

kv: collapse adjacent span configs to reduce number of necessary splits #72389

Comments

irfansharif commented Nov 3, 2021 • edited by celiala Loading

irfansharif commented Apr 7, 2022

irfansharif commented Apr 8, 2022

irfansharif commented Nov 3, 2021 •

edited by celiala

Loading