Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: tenant creation can unintentionally overwrite another tenant's span configs #95882

Closed
irfansharif opened this issue Jan 25, 2023 · 1 comment · Fixed by #96014
Closed
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jan 25, 2023

Describe the problem

#95100 introduced a bug in our handling of tenant span configs. It's possible now that the creation of a tenant ends up overwriting another tenant's span config records, and thus affecting zone configs that the original tenant had applied. #95881 is a branch that contains the reproduction through an annotated test case. Specifically:

# Peek near the start of the span_configurations table where tenant=11's records
# are stored. The first one is from the start of its keyspace to start of
# table with ID=4: /Tenant/11{-/Table/4}.
state offset=56 limit=5
----
...
/Table/5{7-8}                              ttl_seconds=3600 ignore_strict_gc=true num_replicas=5 rangefeed_enabled=true
/Tenant/11{-/Table/4}                      database system (tenant)
/Tenant/11/Table/{4-5}                     database system (tenant)
/Tenant/11/Table/{5-6}                     database system (tenant)
/Tenant/11/Table/{6-7}                     database system (tenant)
...

# Now for the bug. We'll create tenant=10, but see that it actually ends up
# overwriting span config state for the already-reconciling tenant=11. That
# tenant's reconciler is not expecting this! It operates with the invariant that
# it's the only writer of span configs for its keyspan, which we're simply
# breaking.
initialize tenant=10
----

# We re-inserted a record at /Tenant/11{-\x00}, and split up tenant=11's own
# span config records in the process. Previously tenant=11's first record was
/Tenant/11{-/Table/4}. Now it's /Tenant/11{-\x00}. If tenant=11 had set some
span config for that keyspan, tenant=10's creation just nuked it.
----
...
/Tenant/10{-\x00}                          database system (tenant)
/Tenant/11{-\x00}                          database system (tenant)
/Tenant/11/Table/{4-5}                     database system (tenant)
/Tenant/11/Table/{5-6}                     database system (tenant)
/Tenant/11/Table/{6-7}                     database system (tenant)
...

To Reproduce

Using #95881:

$ dev test pkg/ccl/spanconfigccl/spanconfigreconcilerccl -f=TestDataDriven/multitenant/bug -v --timeout=10s --rewrite

Expected behavior

Tenant creation should not overwrite another tenant's span configs.

Jira issue: CRDB-23792
Epic: CRDB-23344

Epic CRDB-23344

@irfansharif irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 25, 2023
@irfansharif irfansharif added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Jan 25, 2023
@arulajmani
Copy link
Collaborator

Good catch! @irfansharif and I spoke about this a bit, and I think what we want when creating a tenant is to check if any span configs exist for the next tenant already. We should only create a span config entry at the start of the next tenant's key space if no span config entries exist for it (the patch linked above creates these blindly). This should work, and give us what we want, considering we're using a transactional KVAccessor in the tenant creation path.

@ecwall ecwall added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jan 27, 2023
craig bot pushed a commit that referenced this issue Feb 7, 2023
95370: logictest: add a skip_on_retry directive to the sequences test r=ajwerner a=ajwerner

See [this bors run](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/8344646?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true)

In the long run, this directive is not great, but it's better than flakes.

Epic: none

Release note: None

96014: multitenant: prevent tenant from overriding next tenant's span config r=arulajmani a=ecwall

Fixes #95882

We want to ensure we have a split at the non-inclusive end of the tenant's
keyspace, which also happens to be the inclusive start of the next
tenant's (with ID=ours+1). If we didn't do anything here, and we were the
tenant with the highest ID thus far, our last range would extend to /Max.
If a tenant with a higher ID was created, when installing its initial span
config record, it would carve itself off (and our last range would only
extend to that next tenant's boundary), but this is a bit awkward for code
that wants to rely on the invariant that ranges for a tenant only extend
to the tenant's well-defined end key.

So how do we ensure this split at this tenant's non-inclusive end key?
Hard splits are controlled by the start keys on span config records[^1],
so we'll try insert one accordingly. But we cannot do this blindly. We
cannot assume that tenants are created in ID order, so it's possible that
the tenant with the next ID is already present + running. If so, it may
already have span config records that start at the key at which we want
to write a span config record for. Over-writing it blindly would be a
mistake -- there's no telling what config that next tenant has associated
for that span. So we'll do something simple -- we'll check transactionally
whether there's anything written already, and if so, do nothing. We
already have the split we need.

[^1]: See ComputeSplitKey in spanconfig.StoreReader.

Release note: None

96111: logictest: extend temp_table test to exercise discard + drop r=ajwerner a=ajwerner

We had bugs with this in earlier releases. This is a forward-port of a test introduced in #96102.

It never failed master, but no reason to ever let this regress.

Epic: none

Release note: None

96125: logictest: frontport test from #96124 r=ajwerner a=ajwerner

This testing is worth having. It does pass.

Epic: None

Release note: None

96206: streamingccl,db-console: add Cross-Cluster Replication metrics page r=adityamaru a=adityamaru

This change adds a new option to the metric dropdown dashboard
for Cross-Cluster Replication. This page will show all the user-facing
metrics relevant to C2C replication.

Informs: #95995

Release note: None

server,db-console: add feature flag for Cross-Cluster Replication dashboard

This change adds a feature flag whose value is determined by the cluster
setting `cross_cluster_replication.enabled`. This feature flag is then used
when rendering the metrics dashboard dropdown options to decide whether to
display the cross-cluster replication dashboard.

This change also deletes the session variable and associated cluster setting
`sql.defaults.experimental_stream_replication.enabled` so as to unify all
interactions with cross-cluster replication under a single cluster setting.

Fixes: #95995

Release note: None

96499: sql: deflake TestParseClientProvidedSessionParameters r=knz a=rafiss

fixes #96492

pgx starts an internal goroutine if a context is cancelled. We need to wait for it to be done in order to avoid a leaked goroutine in the test.

Release note: None

96681: *: Properly support partial UNIQUE WITHOUT INDEX referencing type descs r=Xiang-Gu a=Xiang-Gu

Previously if a partial UNIQUE WITHOUT INDEX references a type descriptor in its predicate, we didn't add back-references in the type descriptor, both in the legacy and declarative schema changer. This commit fixes this.

Fixes #96678
Release note: None


96689: roachtest: return error in StopCockroachGracefullyOnNode r=herkolategan,srosenberg a=renatolabs

That function misleadingly returned an (always nil) error, calling `t.Fatal()` functions in it. The calls to `Stop` have been replaced with calls to `StopE`.

Epic: none

Release note: None

Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Evan Wall <wall@cockroachlabs.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
@craig craig bot closed this as completed in 9d4e27a Feb 7, 2023
@exalate-issue-sync exalate-issue-sync bot reopened this Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants