Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: sql: ensure leases aren't used when validing fk in txn #96124

Conversation

ajwerner
Copy link
Contributor

When validating a foreign key in a transaction and the referenced table has been modified in the same transaction, we'd fail to notice this fact and use the regular descriptor leasing protocol to fetch the descriptor. Due to implementation details of the leasing protocol, the behavior is that we'd use a high priority transaction to attempt to lease the descriptor, fail, and then fall back to resolving the correct value using the appropriate transaction. Thus, we did not suffer a correctness problem in the case where the table is newly created. Furthermore, we didn't really have a problem generally when the descriptor is not newly created but the constraint is, because we'd defer the validation to an async job. The hazard is that the transaction would be pushed into the future by the attempt to lease the descriptor, which can lead to restarts, or, worse, if the user transaction used PRIORITY HIGH, we'd end up with a deadlock.

Epic: none

Release justification: Fixes a bug

Release note (bug fix): Fixed a bug where spurious transaction restarts could occur when validating a FOREIGN KEY in the same transaction where the referenced table is modified. If the transaction was running at PRIORITY HIGH, deadlocks could occur.

@ajwerner ajwerner requested a review from a team January 27, 2023 20:45
@blathers-crl
Copy link

blathers-crl bot commented Jan 27, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner Good find, that is an ugly one. :lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jan 27, 2023
This testing is worth having. It does pass.

Epic: None

Release note: None
When validating a foreign key in a transaction and the referenced table
has been modified in the same transaction, we'd fail to notice this fact
and use the regular descriptor leasing protocol to fetch the descriptor.
Due to implementation details of the leasing protocol, the behavior is that
we'd use a high priority transaction to attempt to lease the descriptor,
fail, and then fall back to resolving the correct value using the
appropriate transaction. Thus, we did not suffer a correctness problem
in the case where the table is newly created. Furthermore, we didn't
really have a problem generally when the descriptor is not newly created
but the constraint is, because we'd defer the validation to an async
job. The hazard is that the transaction would be pushed into the future
by the attempt to lease the descriptor, which can lead to restarts, or,
worse, if the user transaction used PRIORITY HIGH, we'd end up with a
deadlock.

Epic: none

Release note (bug fix): Fixed a bug where spurious transaction
restarts could occur when validating a FOREIGN KEY in the same
transaction where the referenced table is modified. If the
transaction was running at PRIORITY HIGH, deadlocks could occur.
@ajwerner ajwerner force-pushed the ajwerner/fix-deadlock-with-fk-validation branch from f1da510 to dfdd1ec Compare January 30, 2023 01:13
@ajwerner ajwerner merged commit 706993a into cockroachdb:release-22.1 Jan 30, 2023
craig bot pushed a commit that referenced this pull request Feb 7, 2023
95370: logictest: add a skip_on_retry directive to the sequences test r=ajwerner a=ajwerner

See [this bors run](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/8344646?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true)

In the long run, this directive is not great, but it's better than flakes.

Epic: none

Release note: None

96014: multitenant: prevent tenant from overriding next tenant's span config r=arulajmani a=ecwall

Fixes #95882

We want to ensure we have a split at the non-inclusive end of the tenant's
keyspace, which also happens to be the inclusive start of the next
tenant's (with ID=ours+1). If we didn't do anything here, and we were the
tenant with the highest ID thus far, our last range would extend to /Max.
If a tenant with a higher ID was created, when installing its initial span
config record, it would carve itself off (and our last range would only
extend to that next tenant's boundary), but this is a bit awkward for code
that wants to rely on the invariant that ranges for a tenant only extend
to the tenant's well-defined end key.

So how do we ensure this split at this tenant's non-inclusive end key?
Hard splits are controlled by the start keys on span config records[^1],
so we'll try insert one accordingly. But we cannot do this blindly. We
cannot assume that tenants are created in ID order, so it's possible that
the tenant with the next ID is already present + running. If so, it may
already have span config records that start at the key at which we want
to write a span config record for. Over-writing it blindly would be a
mistake -- there's no telling what config that next tenant has associated
for that span. So we'll do something simple -- we'll check transactionally
whether there's anything written already, and if so, do nothing. We
already have the split we need.

[^1]: See ComputeSplitKey in spanconfig.StoreReader.

Release note: None

96111: logictest: extend temp_table test to exercise discard + drop r=ajwerner a=ajwerner

We had bugs with this in earlier releases. This is a forward-port of a test introduced in #96102.

It never failed master, but no reason to ever let this regress.

Epic: none

Release note: None

96125: logictest: frontport test from #96124 r=ajwerner a=ajwerner

This testing is worth having. It does pass.

Epic: None

Release note: None

96206: streamingccl,db-console: add Cross-Cluster Replication metrics page r=adityamaru a=adityamaru

This change adds a new option to the metric dropdown dashboard
for Cross-Cluster Replication. This page will show all the user-facing
metrics relevant to C2C replication.

Informs: #95995

Release note: None

server,db-console: add feature flag for Cross-Cluster Replication dashboard

This change adds a feature flag whose value is determined by the cluster
setting `cross_cluster_replication.enabled`. This feature flag is then used
when rendering the metrics dashboard dropdown options to decide whether to
display the cross-cluster replication dashboard.

This change also deletes the session variable and associated cluster setting
`sql.defaults.experimental_stream_replication.enabled` so as to unify all
interactions with cross-cluster replication under a single cluster setting.

Fixes: #95995

Release note: None

96499: sql: deflake TestParseClientProvidedSessionParameters r=knz a=rafiss

fixes #96492

pgx starts an internal goroutine if a context is cancelled. We need to wait for it to be done in order to avoid a leaked goroutine in the test.

Release note: None

96681: *: Properly support partial UNIQUE WITHOUT INDEX referencing type descs r=Xiang-Gu a=Xiang-Gu

Previously if a partial UNIQUE WITHOUT INDEX references a type descriptor in its predicate, we didn't add back-references in the type descriptor, both in the legacy and declarative schema changer. This commit fixes this.

Fixes #96678
Release note: None


96689: roachtest: return error in StopCockroachGracefullyOnNode r=herkolategan,srosenberg a=renatolabs

That function misleadingly returned an (always nil) error, calling `t.Fatal()` functions in it. The calls to `Stop` have been replaced with calls to `StopE`.

Epic: none

Release note: None

Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Evan Wall <wall@cockroachlabs.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Feb 7, 2023
This testing is worth having. It does pass.

Epic: None

Release note: None
rafiss added a commit that referenced this pull request Apr 26, 2023
…22.2-96125

release-22.2: logictest: frontport test from #96124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants