spanconfig: revisit 3-replica default for keyspace not referenced by SQL zone configs #91238

tbg · 2022-11-03T21:16:35Z

Is your feature request related to a problem? Please describe.

In https://github.com/cockroachlabs/support/issues/1858 we have a customer who set all of their zone configs to 5x replication. However, spans no longer referenced by a zone config (for example ranges that still haven't merged away - and may not for some time, for maybe weeks) fall back to the default spanconfig which has 3x replication regardless of the default zone config.

This is awkward, since even though these ranges are not referenced by SQL schema, losing availability of them still causes issues. Replication changes on these ranges can leave blocking intents on meta2; metrics will pick up on the unavailability and set off operator alerts, etc.

Ideally there would be a way for the system tenant to specify, in some way, the replication factor for anything not referenced by the SQL zone configs.

Jira issue: CRDB-21168

tbg · 2022-11-03T21:23:22Z

(for example ranges that still haven't merged away - and may not for some time, for maybe weeks)

Actually they're probably around for good. We have a situation like this:

r1000 /Table/15-/Table/16 (part of a SQL table, i.e. not guaranteed small)
r1001 /Table/16-/Table/27 (empty ranges, used to be tables, but now dropped and removed from schema and not referenced in zcfgs)
r1002 /Table/27-/Table/28 (part of a SQL table, i.e. not guaranteed small)

r1001 won't ever be merged away, as that would require r1000 to be under the size threshold.

irfansharif · 2022-11-04T15:29:12Z

I'm going to put this on the SQL schema board. When divvying up what parts of the spanconfig package sit where with @ajwerner + friends, the parts implicated here are above KV. Please punt it back if I've gotten it wrong.

irfansharif · 2022-11-04T15:31:47Z

r1001 won't ever be merged away, as that would require r1000 to be under the size threshold.

It could also be that if the RHS range being merged is "empty" as would be in this example, we ignore the RHS size/threshold since we're not worsening it. Perhaps that's what you were getting at, Tobi?

tbg · 2022-11-07T08:46:21Z

Ah, so you're saying it should be merged away yet? Yeah, if we have that heuristic that should do it, in theory, but the RHS isn't exactly empty (always some system keys in it that won't go away until GC triggers, but they don't bump the score so that'll never happen)

cockroach/pkg/kv/kvserver/merge_queue.go

Lines 230 to 238 in e822649

    
           rhsDesc, rhsStats, rhsQPS, rhsQPSOK, err := mq.requestRangeStats(ctx, lhsDesc.EndKey.AsRawKey()) 
        
           if err != nil { 
        
           	return false, err 
        
           } 
        
           if rhsStats.Total() >= minBytes { 
        
           	log.VEventf(ctx, 2, "skipping merge: RHS meets minimum size threshold %d with %d bytes", 
        
           		minBytes, rhsStats.Total()) 
        
           	return false, nil 
        
           }

is what fired above, I think. But even then, if the LHS is just under the split threshold, there is this check we'd hit instead:

cockroach/pkg/kv/kvserver/merge_queue.go

Lines 283 to 292 in e822649

    
           conservativeLoadBasedSplitThreshold := 0.5 * lhsRepl.SplitByLoadQPSThreshold() 
        
           shouldSplit, _ := shouldSplitRange(ctx, mergedDesc, mergedStats, 
        
           	lhsRepl.GetMaxBytes(), lhsRepl.shouldBackpressureWrites(), confReader) 
        
           if shouldSplit || mergedQPS >= conservativeLoadBasedSplitThreshold { 
        
           	log.VEventf(ctx, 2, 
        
           		"skipping merge to avoid thrashing: merged range %s may split "+ 
        
           			"(estimated size, estimated QPS: %d, %v)", 
        
           		mergedDesc, mergedStats.Total(), mergedQPS) 
        
           	return false, nil 
        
           }

ajwerner · 2022-11-08T15:36:29Z

@tbg can you come back and communicate any sort of priority on this? Nominally we own this, but we don't have a lot of context as a team and we don't have strong opinions. If a speedy outcome is needed, we'll help in any way we can.

tbg · 2022-11-13T22:31:12Z

No speedy outcome needed, this just doesn't seem quite right. I will assume the customer in https://github.com/cockroachlabs/support/issues/1858 is going to be satisfied once the range has merged away.

knz · 2022-11-24T14:20:55Z

Here are holes that also need to be considered:

keyspace of INACTIVE tenant, e.g. during c2c replication
keyspace of newly created tenant (create_tenant), before the SQL service starts and starts running span generation
keyspace of newly ingested/imported/restored data

For this setting type: - the protoutil.Message is held in memory, - the byte representation is stored in system.settings, and - the json representation is used when accepting input and rendering state (through SHOW CLUSTER SETTING <setting-name>, the raw form is visible when looking directly at system.settings) We also use this setting type to support power a spanconfig.store.fallback_config_override, which overrides the fallback config used for ranges with no explicit span configs set. Previously we used a hardcoded value -- this makes it a bit more configurable. This is a partial and backportable workaround (read: hack) for cockroachdb#91238 and \cockroachdb#91239. In an internal escalation we observed "orphaned" ranges from dropped tables that were not being being referenced by span configs (by virtue of them originating from now-dropped tables/configs). Typically ranges of this sort are short-lived, they're emptied out through GC and merged into LHS neighbors. But if the neighboring ranges are large enough, or load just high enough, the merge does not occur. For such orphaned ranges we were using a hardcoded "fallback config", with a replication factor of three. This made for confusing semantics where if RANGE DEFAULT was configured to have a replication factor of five, our replication reports indicated there were under-replicated ranges. This is partly because replication reports today are powered by zone configs (thus looking at RANGE DEFAULT) -- this will change shortly as part of \cockroachdb#89987 where we'll instead consider span config data. In any case, we were warning users of under-replicated ranges but within KV we were not taking any action to upreplicate them -- KV was respecting the hard-coded fallback config. The issues above describe that we should apply each tenant's RANGE DEFAULT config to all such orphaned ranges, which is probably the right fix. This was alluded to in an earlier TODO but is still left for future work. // TODO(irfansharif): We're using a static[1] fallback span config // here, we could instead have this directly track the host tenant's // RANGE DEFAULT config, or go a step further and use the tenant's own // RANGE DEFAULT instead if the key is within the tenant's keyspace. // We'd have to thread that through the KVAccessor interface by // reserving special keys for these default configs. // // [1]: Modulo the private spanconfig.store.fallback_config_override, which // applies globally. So this PR instead takes a shortcut -- it makes the static config configurable through a cluster setting. We can now do the following which alters what fallback config is applied to orphaned ranges, and in our example above, force such ranges to also have a replication factor of five. SET CLUSTER SETTING spanconfig.store.fallback_config_override = ' { "gcPolicy": {"ttlSeconds": 3600}, "numReplicas": 5, "rangeMaxBytes": "536870912", "rangeMinBytes": "134217728" }'; Release note: None

92466: settings,spanconfig: introduce a protobuf setting type r=irfansharif a=irfansharif For this setting type: - the `protoutil.Message` is held in memory, - the byte representation is stored in `system.settings`, and - the json representation is used when accepting input and rendering state (through `SHOW CLUSTER SETTING <setting-name>`, the raw form is visible when looking directly at `system.settings`) We also use this setting type to support power a `spanconfig.store.fallback_config_override`, which overrides the fallback config used for ranges with no explicit span configs set. Previously we used a hardcoded value -- this makes it a bit more configurable. This is a partial and backportable workaround (read: hack) for #91238 and \#91239. In an internal escalation we observed "orphaned" ranges from dropped tables that were not being being referenced by span configs (by virtue of them originating from now-dropped tables/configs). Typically ranges of this sort are short-lived, they're emptied out through GC and merged into LHS neighbors. But if the neighboring ranges are large enough, or load just high enough, the merge does not occur. For such orphaned ranges we were using a hardcoded "fallback config", with a replication factor of three. This made for confusing semantics where if `RANGE DEFAULT` was configured to have a replication factor of five, our replication reports indicated there were under-replicated ranges. This is partly because replication reports today are powered by zone configs (thus looking at `RANGE DEFAULT`) -- this will change shortly as part of \#89987 where we'll instead consider span config data. In any case, we were warning users of under-replicated ranges but within KV we were not taking any action to upreplicate them -- KV was respecting the hard-coded fallback config. The issues above describe that we should apply each tenant's `RANGE DEFAULT` config to all such orphaned ranges, which is probably the right fix. This was alluded to in an earlier TODO but is still left for future work. ```go // TODO(irfansharif): We're using a static[1] fallback span config // here, we could instead have this directly track the host tenant's // RANGE DEFAULT config, or go a step further and use the tenant's own // RANGE DEFAULT instead if the key is within the tenant's keyspace. // We'd have to thread that through the KVAccessor interface by // reserving special keys for these default configs. // // [1]: Modulo the private spanconfig.store.fallback_config_override, which // applies globally. ``` So this PR instead takes a shortcut -- it makes the static config configurable through a cluster setting. We can now do the following which alters what fallback config is applied to orphaned ranges, and in our example above, force such ranges to also have a replication factor of five. ```sql SET CLUSTER SETTING spanconfig.store.fallback_config_override = ' { "gcPolicy": {"ttlSeconds": 3600}, "numReplicas": 5, "rangeMaxBytes": "536870912", "rangeMinBytes": "134217728" }'; ``` Release note: None 92585: roachtest: reduce frequency of index-overload test r=irfansharif a=irfansharif Release note: None 92586: jobs: refresh job details before removing protected timestamps r=fqazi a=fqazi Fixes: #92493 Previously, we added the protected timestamps manager into the jobs frame work, which made it easier to automatically add and remove protected timestamps for jobs. Unfortunately, the Unprotect API when invoked from a clean up function never properly refreshed the job. Which could lead to a race condition trying to remove protected timestamps for schema changes. To address this, the Unprotect function will take advantage of the job update function to confirm that a refreshed copy does have protected timestamps set. Release note: None 92593: logictest: avoid a lint failure r=andreimatei a=knz Fixes #92592 The go compiler treats calls to `bazel.BuiltWithBazel()` as a compile-time constant. Therefore, ```go if !bazel.BuiltWithBazel() { skip.XXX() } ``` is considered to always terminate execution (because `skip` does its job by raising a panic), and any code coming after that is treated as dead/unreachable. 92599: multitenant: merge serverpb.RegionsServer into serverpb.TenantStatusServer r=knz a=ecwall Fixes #92598 Merge these 2 interfaces in anticipation for future work where more tenant functionality will be added to TenantStatusServer instead of creating an interface per function. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Evan Wall <wall@cockroachlabs.com>

tbg · 2023-03-20T08:46:11Z

@irfansharif is this issue still current? I think you did something to address it.

irfansharif · 2023-03-20T14:17:26Z

Ideally there would be a way for the system tenant to specify, in some way, the replication factor for anything not referenced by the SQL zone configs.

In 23.1+ this is now possible using:

 SET CLUSTER SETTING spanconfig.store.fallback_config_override = '
    {
      "gcPolicy": {"ttlSeconds": 3600},
      "numReplicas": 5,
      "rangeMaxBytes": "536870912",
      "rangeMinBytes": "134217728"
    }';

In 22.2, 22.1 it's possible to control num_replicas using the more targeted COCKROACH_FALLBACK_SPANCONFIG_NUM_REPLICAS_OVERRIDE from #93093 + #93092.

Finally, we introduced new "conformance report" code that uses the span config state as authoritative: #90016, addressing #91239. So these kinds of discrepancies, for state no longer referenced by SQL schema, is unlikely to come up anywhere. Closing.

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Nov 3, 2022

blathers-crl bot added the T-kv KV Team label Nov 3, 2022

tbg mentioned this issue Nov 3, 2022

reports: range not covered by spanconfig reported as under-replicated #91239

Closed

blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Nov 4, 2022

irfansharif removed the T-kv KV Team label Nov 4, 2022

postamar added the A-zone-configs label Nov 10, 2022

irfansharif mentioned this issue Nov 24, 2022

settings,spanconfig: introduce a protobuf setting type #92466

Merged

irfansharif closed this as completed Mar 20, 2023

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig: revisit 3-replica default for keyspace not referenced by SQL zone configs #91238

spanconfig: revisit 3-replica default for keyspace not referenced by SQL zone configs #91238

tbg commented Nov 3, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Nov 3, 2022

irfansharif commented Nov 4, 2022

irfansharif commented Nov 4, 2022

tbg commented Nov 7, 2022

ajwerner commented Nov 8, 2022

tbg commented Nov 13, 2022

knz commented Nov 24, 2022

tbg commented Mar 20, 2023

irfansharif commented Mar 20, 2023

spanconfig: revisit 3-replica default for keyspace not referenced by SQL zone configs #91238

spanconfig: revisit 3-replica default for keyspace not referenced by SQL zone configs #91238

Comments

tbg commented Nov 3, 2022 • edited by cockroach-jira-scripts Loading

tbg commented Nov 3, 2022

irfansharif commented Nov 4, 2022

irfansharif commented Nov 4, 2022

tbg commented Nov 7, 2022

ajwerner commented Nov 8, 2022

tbg commented Nov 13, 2022

knz commented Nov 24, 2022

tbg commented Mar 20, 2023

irfansharif commented Mar 20, 2023

tbg commented Nov 3, 2022 •

edited by cockroach-jira-scripts

Loading