-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapter: Refactor RelationDesc
#29231
adapter: Refactor RelationDesc
#29231
Conversation
c112553
to
9e20857
Compare
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:The pull request has a high risk score of 81, primarily driven by the predictors "Sum Bug Reports Of Files" and "Delta of Executable Lines," along with modifications in one file hotspot. Historically, PRs with these predictors are 109% more likely to cause a bug than the repository baseline. Notably, the observed bug trend in the repository is decreasing. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
/// Index into a [`RelationType`] for this column. | ||
typ_idx: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to put ColumnType
here instead of an index, but RelationType
is passed around so much in the compute code that this would be a large refactor
/// Diffs that can be generated proptest and applied to a [`RelationDesc`] to | ||
/// exercise schema migrations. | ||
#[derive(Debug)] | ||
pub enum PropRelationDescDiff { | ||
AddColumn { name: ColumnName, typ: ColumnType }, | ||
DropColumn { name: ColumnName }, | ||
MovePosition { name: ColumnName, new_pos: usize }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only test code, after this change we can't really move the position of a column
src/catalog/src/builtin.rs
Outdated
.with_column("region", ScalarType::String.nullable(true)) | ||
.with_column("access_key_id", ScalarType::String.nullable(true)) | ||
.with_column("access_key_id_secret_id", ScalarType::String.nullable(true)) | ||
.with_column( | ||
"secret_access_key_secret_id", | ||
ScalarType::String.nullable(true), | ||
) | ||
.with_column("session_token", ScalarType::String.nullable(true)) | ||
.with_column("session_token_secret_id", ScalarType::String.nullable(true)) | ||
.with_column("assume_role_arn", ScalarType::String.nullable(true)) | ||
.with_column( | ||
"assume_role_session_name", | ||
ScalarType::String.nullable(true), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I didn't mean to delete these columns, fixed this in a later commit
src/repr/src/row/encoding2.rs
Outdated
(ColumnName::from("foo"), ScalarType::Int64.nullable(true)), | ||
(ColumnName::from("bar"), ScalarType::String.nullable(false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to swap the nullability of the columns here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh I did not, sorry about the miss, fixed!
src/repr/src/relation.rs
Outdated
/// Drops the column `name` from this [`RelationDesc`]. If there are multiple columns with | ||
/// `name` drops the left-most one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple columns with
///name
drops the left-most one.
Is it possible to have multiple columns with the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is something we want to support, e.g.
CREATE TABLE t1 (a int);
ALTER TABLE t1 DROP COLUMN a;
ALTER TABLE t1 ADD COLUMN a text;
We would keep the original a int
as ColumnIndex(0)
and have a text
as ColumnIndex(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case if I tried to do something like
CREATE TABLE t1 (a int);
ALTER TABLE t1 DROP COLUMN a;
ALTER TABLE t1 ADD COLUMN a text;
ALTER TABLE ts DROP COLUMN a;
then the second drop would panic on line 988?
Maybe the description should read something like
If there are multiple columns with
name
drops the left-most non-dropped one.
and line 984 should be something like .find(|meta| meta.name == name && meta.dropped.is_none())
// N.B. At this point we need to be careful because col_idx might not | ||
// equal typ_idx. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why's that? I don't think I understood how col_idx and typ_idx could diverge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh because you're not retaining dropped columns in column_metas
but they are still influencing col_idx
and not typ_idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh because you're not retaining dropped columns in column_metas but they are still influencing col_idx and not typ_idx?
Exactly! Ideally we would track ColumnType
in the BTreeMap<...>
but RelationType
is used so widely in compute that it's a big refactor. I added an example in the comment and this is exercised in the snapshot tests below, let me know if I should add more!
src/repr/src/relation.rs
Outdated
|
||
let relation_type = RelationType { | ||
column_types, | ||
keys: self.inner.typ.keys.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if keys
reference a dropped column or a column that was added after version
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout! I don't think we'll allow dropping a column that is a key, at least not v0, so I made that panic. I didn't consider the case of keys getting re-ordered though so thank you for calling that out, I fixed this and added a test case
/// | ||
/// Panics if a constraint is not satisfied. | ||
fn validate(&self) { | ||
fn validate_inner(desc: &RelationDesc) -> Result<(), anyhow::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any validations to add on desc.typ.keys
? Maybe that each index is valid and that a single key doesn't have duplicate indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, added some validation
|
||
/// Metadata (other than type) for a column in a [`RelationDesc`]. | ||
#[derive(Clone, Debug, Eq, PartialEq, Serialize, Deserialize, Hash, MzReflect)] | ||
struct ColumnMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions about the data model:
- When do we set
ColumnMetadata::dropped
, and when do we remove the entry from the map entirely? - At what point do we expect the caller to call
version_at
? Does that always happen by the time it's passed to Persist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry if these are redundant; I may be missing context!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great questions!
When do we set ColumnMetadata::dropped
We set ColumnMetadata::dropped
when a user runs ALTER TABLE ... DROP COLUMN ...
. We aren't planning to support this immediately but I figured while I was here I would add support.
...when do we remove the entry from the map entirely?
We won't ever drop the column entirely, but if a user runs at_version(...)
with a version where the column doesn't exist, it won't be included in the returned RelationDesc
s map.
At what point do we expect the caller to call version_at? Does that always happen by the time it's passed to Persist?
This should get called during planning! Persist should never have to worry about calling at_version(...)
. In fact, I put at_version(...)
on a newtype so it shouldn't ever be possible to call at_version(...)
outside of planning
N: Into<ColumnName>, | ||
T: Into<ColumnType>, | ||
{ | ||
let latest_version = self.latest_version(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the cost linear in the number of existing fields. (So adding N fields is quadratic.) Not sure if we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could cache the latest version, but this should only get called when a user runs ALTER TABLE ...
so I'm not concerned about the runtime here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/repr/src/relation.rs
Outdated
/// Drops the column `name` from this [`RelationDesc`]. If there are multiple columns with | ||
/// `name` drops the left-most one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case if I tried to do something like
CREATE TABLE t1 (a int);
ALTER TABLE t1 DROP COLUMN a;
ALTER TABLE t1 ADD COLUMN a text;
ALTER TABLE ts DROP COLUMN a;
then the second drop would panic on line 988?
Maybe the description should read something like
If there are multiple columns with
name
drops the left-most non-dropped one.
and line 984 should be something like .find(|meta| meta.name == name && meta.dropped.is_none())
fef33bb
to
c8a5976
Compare
@jkosh44 good call with the duplicate columns, it was pretty easy to support that. Updated the comments and added tests! |
* refactors RelationDesc to internally have a BTreeMap<ColumnIndex, ColumnMetadata> * track added and dropped versions in ColumnMetadata
…lationDesc * Remove the RelationDesc::with_column method
…with version from before we added the metadata field
* Update comments around handling of columns with the same name * Update handling of keys, make sure we remap keys to handle dropped columns
287fbb5
to
92d8e6a
Compare
This PR refactors
RelationDesc
to support adding columns to tables, it does a few things:RelationDescBuilder
and moves a few methods, namelywith_column(...)
onto the new builder. The goal here was to disambiguate adding a column to a new version of theRelationDesc
and the initial construction of aRelationDesc
.RelationDesc
from aVec<ColumnName>
to aBTreeMap<ColumnIndex, ColumnMetadata>
. There are two primary things this enables:ColumnIndex
ColumnMetadata
tracks the version at which a column was added and optionally dropped.Motivation
Progress towards #28082
Tips for reviewer
These changes are split into two separate commits:
RelationDescBuilder
, this is more or less code movement.RelationDesc
. This shouldn't cause any behavior changes but is a decent amount of changes.Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.