-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/schemachanger: address bugs with column families #99953
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find!
On master and 23.1 after the removal of oprules, we have scenarios where not implemented assertions can be hit for column families when rollbacks occur. These changes add a more concrete assertion that just ensures that the column family is cleaned up, and rules to ensure appropriate sequencing for removal.
Can you file an issue for this and mark it as a GA-blocker?
When UPDATE/INSERTs were executed concurrently while adding a new column family, we could end up writing to the old primary key with the new column family. In happy path cases where everything was successful, this didn't matter, but if a rollback occurred we would have values left in the old primary index that runtime couldn't handle.
This we need to backport, yeah?
@@ -429,6 +429,13 @@ type AddColumnFamily struct { | |||
Name string | |||
} | |||
|
|||
// AssertColumnFamilyIsRemoved asserts that a column family is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. Why are we doing this here and not during descriptor validation? I do see the reason to do it rather than the notImplemented
, but let's add more commentary that this is an unusual operation added for defense in depth to cover over the fact that other operations should be doing the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me add the justification. My fear here is the disconnect between the element model since we want the transition to be official once it's actually removed.
Yes this one will need a backport |
Previously, dropping column families was not implemented, when we eliminated oprules, we replaced the ops with NotImplementedForDrop, which wasn't sufficient for dropped columns. The column families are cleaned up when the column type itself is dropped, so we don't really need to do much here. To address this, this patch will add code to only assert that either the table is being dropped or the column family has been removed earlier. Fixes: cockroachdb#99796 Release note: None
999d485
to
6c4208d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 2 of 2 files at r2, 12 of 66 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
pkg/sql/row/writer.go
line 143 at r2 (raw file):
} // Skip any values with a default ID not stored in the primary index, // which can happen if we are adding new columns,
nit: sentence ends in comma
pkg/sql/rowenc/index_encoding.go
line 1086 at r2 (raw file):
if !storedColumns.Contains(family.DefaultColumnID) { return nil }
This needs to be added here too:
cockroach/pkg/sql/colenc/encode.go
Line 295 in 3f113fc
// Storage optimization to store DefaultColumnID directly as a value. Also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @cucaroach)
pkg/sql/rowenc/index_encoding.go
line 1086 at r2 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
This needs to be added here too:
cockroach/pkg/sql/colenc/encode.go
Line 295 in 3f113fc
// Storage optimization to store DefaultColumnID directly as a value. Also
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 54 of 66 files at r4, 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/sql/rowenc/index_encoding.go
line 1086 at r2 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Done.
I don't see it?
6c4208d
to
8dc6b16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @cucaroach)
pkg/sql/rowenc/index_encoding.go
line 1086 at r2 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
I don't see it?
@cucaroach Sorry about that forgot to push, should be there now
8dc6b16
to
73a1573
Compare
…y index Previously, if a new column family was added during an add column and an update/insert occurred concurrently, we could end up writing to this new column family in any primary index. This was inadequate because if the primary index did not store the column, then runtime will have trouble reading data from this table, since after a rollback the added column / column family will get cleaned up from the table descriptor. To address this, this patch avoids writing any columns not stored within an index descriptor. Fixes: cockroachdb#99950 Release note (bug fix): Concurrent DML while adding a new column with a new column family can lead to corruption in the existing primary index. If a rollback occurs the table may no longer be accessible.
73a1573
to
b644d3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs release notes |
@ajwerner Only the UPDATE/INSERT one needs a release note, the column family cleanup is not in the field. Only on master and 23.1 |
@ajwerner @cucaroach TFTR! |
Previously, our DML injection tests only covered happy paths for schema changes. We recently ran into a bug where the old primary index was corrupted after DML and left in a bad state visible only after a rollback. To address this, this patch adds tests for this specific scenario and adds the ability to test rollbacks with DML injection. Epic: none Release note: None
b644d3b
to
594ad87
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 6ec1cd7 to blathers/backport-release-22.2-99953: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This PR addresses the following bugs with column families:
Release note (bug fix): Concurrent DML while adding
a new column with a new column family can lead to
corruption in the existing primary index. If a rollback
occurs the table may no longer be accessible.