Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/schemachanger: address bugs with column families #99953

Merged
merged 3 commits into from
Mar 30, 2023

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Mar 29, 2023

This PR addresses the following bugs with column families:

  1. On master and 23.1 after the removal of oprules, we have scenarios where not implemented assertions can be hit for column families when rollbacks occur. These changes add a more concrete assertion that just ensures that the column family is cleaned up, and rules to ensure appropriate sequencing for removal.
  2. When UPDATE/INSERTs were executed concurrently while adding a new column family, we could end up writing to the old primary key with the new column family. In happy path cases where everything was successful, this didn't matter, but if a rollback occurred we would have values left in the old primary index that runtime couldn't handle.
  3. We had no way of validating DML with concurrent schema changes in cases with rollbacks, these modifications add tests and the framework required this case.

Release note (bug fix): Concurrent DML while adding
a new column with a new column family can lead to
corruption in the existing primary index. If a rollback
occurs the table may no longer be accessible.

@fqazi fqazi requested a review from a team March 29, 2023 17:36
@fqazi fqazi requested a review from a team as a code owner March 29, 2023 17:36
@fqazi fqazi requested a review from yuzefovich March 29, 2023 17:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

On master and 23.1 after the removal of oprules, we have scenarios where not implemented assertions can be hit for column families when rollbacks occur. These changes add a more concrete assertion that just ensures that the column family is cleaned up, and rules to ensure appropriate sequencing for removal.

Can you file an issue for this and mark it as a GA-blocker?

When UPDATE/INSERTs were executed concurrently while adding a new column family, we could end up writing to the old primary key with the new column family. In happy path cases where everything was successful, this didn't matter, but if a rollback occurred we would have values left in the old primary index that runtime couldn't handle.

This we need to backport, yeah?

@@ -429,6 +429,13 @@ type AddColumnFamily struct {
Name string
}

// AssertColumnFamilyIsRemoved asserts that a column family is removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Why are we doing this here and not during descriptor validation? I do see the reason to do it rather than the notImplemented, but let's add more commentary that this is an unusual operation added for defense in depth to cover over the fact that other operations should be doing the work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me add the justification. My fear here is the disconnect between the element model since we want the transition to be official once it's actually removed.

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 29, 2023

Good find!

On master and 23.1 after the removal of oprules, we have scenarios where not implemented assertions can be hit for column families when rollbacks occur. These changes add a more concrete assertion that just ensures that the column family is cleaned up, and rules to ensure appropriate sequencing for removal.

Can you file an issue for this and mark it as a GA-blocker?

#99796

When UPDATE/INSERTs were executed concurrently while adding a new column family, we could end up writing to the old primary key with the new column family. In happy path cases where everything was successful, this didn't matter, but if a rollback occurred we would have values left in the old primary index that runtime couldn't handle.

This we need to backport, yeah?

Yes this one will need a backport

@yuzefovich yuzefovich requested review from cucaroach and removed request for yuzefovich March 29, 2023 18:14
Previously, dropping column families was not implemented,
when we eliminated oprules, we replaced the ops with
NotImplementedForDrop, which wasn't sufficient for
dropped columns. The column families are cleaned up
when the column type itself is dropped, so we don't
really need to do much here. To address this, this
patch will add code to only assert that either the
table is being dropped or the column family has been
removed earlier.

Fixes: cockroachdb#99796
Release note: None
@fqazi fqazi requested a review from a team as a code owner March 29, 2023 18:42
@fqazi fqazi added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Mar 29, 2023
Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, 2 of 2 files at r2, 12 of 66 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)


pkg/sql/row/writer.go line 143 at r2 (raw file):

			}
			// Skip any values with a default ID not stored in the primary index,
			// which can happen if we are adding new columns,

nit: sentence ends in comma


pkg/sql/rowenc/index_encoding.go line 1086 at r2 (raw file):

			if !storedColumns.Contains(family.DefaultColumnID) {
				return nil
			}

This needs to be added here too:

// Storage optimization to store DefaultColumnID directly as a value. Also

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @cucaroach)


pkg/sql/rowenc/index_encoding.go line 1086 at r2 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

This needs to be added here too:

// Storage optimization to store DefaultColumnID directly as a value. Also

Done.

@fqazi fqazi requested a review from cucaroach March 29, 2023 19:08
Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 54 of 66 files at r4, 2 of 2 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/rowenc/index_encoding.go line 1086 at r2 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Done.

I don't see it?

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @cucaroach)


pkg/sql/rowenc/index_encoding.go line 1086 at r2 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

I don't see it?

@cucaroach Sorry about that forgot to push, should be there now

…y index

Previously, if a new column family was added during an add
column and an update/insert occurred concurrently, we could
end up writing to this new column family in any primary index.
This was inadequate because if the primary index did not store
the column, then runtime will have trouble reading data from this
table, since after a rollback the added column / column family
will get cleaned up from the table descriptor. To address this,
this patch avoids writing any columns not stored within an index
descriptor.

Fixes: cockroachdb#99950

Release note (bug fix): Concurrent DML while adding
a new column with a new column family can lead to
corruption in the existing primary index. If a rollback
occurs the table may no longer be accessible.
Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Note that I only reviewed the middle commit.

@ajwerner
Copy link
Contributor

This needs release notes

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 30, 2023

@ajwerner Only the UPDATE/INSERT one needs a release note, the column family cleanup is not in the field. Only on master and 23.1

@fqazi
Copy link
Collaborator Author

fqazi commented Mar 30, 2023

@ajwerner @cucaroach TFTR!

Previously, our DML injection tests only covered happy paths
for schema changes. We recently ran into a bug where the old
primary index was corrupted after DML and left in a bad state
visible only after a rollback. To address this, this patch
adds tests for this specific scenario and adds the ability
to test rollbacks with DML injection.

Epic: none

Release note: None
@blathers-crl
Copy link

blathers-crl bot commented Mar 30, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Mar 30, 2023
@fqazi
Copy link
Collaborator Author

fqazi commented Mar 30, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 30, 2023

Build succeeded:

@craig craig bot merged commit 96e8593 into cockroachdb:master Mar 30, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 30, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 6ec1cd7 to blathers/backport-release-22.2-99953: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@fqazi fqazi removed the O-community Originated from the community label Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants