Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter: Refactor RelationDesc #29231

Merged

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Aug 26, 2024

This PR refactors RelationDesc to support adding columns to tables, it does a few things:

  1. Creates a RelationDescBuilder and moves a few methods, namely with_column(...) onto the new builder. The goal here was to disambiguate adding a column to a new version of the RelationDesc and the initial construction of a RelationDesc.
  2. Refactors the internals of RelationDesc from a Vec<ColumnName> to a BTreeMap<ColumnIndex, ColumnMetadata>. There are two primary things this enables:
    1. Stable tracking of column positions with ColumnIndex
    2. ColumnMetadata tracks the version at which a column was added and optionally dropped.

Motivation

Progress towards #28082

Tips for reviewer

These changes are split into two separate commits:

  1. Adding RelationDescBuilder, this is more or less code movement.
  2. Refactoring the internals of RelationDesc. This shouldn't cause any behavior changes but is a decent amount of changes.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@ParkMyCar ParkMyCar force-pushed the adapter/refactor-relation-desc branch from c112553 to 9e20857 Compare August 26, 2024 21:28
@ParkMyCar ParkMyCar marked this pull request as ready for review August 26, 2024 21:37
@ParkMyCar ParkMyCar requested review from a team as code owners August 26, 2024 21:37
@ParkMyCar ParkMyCar requested a review from jkosh44 August 26, 2024 21:37
Copy link

shepherdlybot bot commented Aug 26, 2024

Risk Score:81 / 100 Bug Hotspots:1 Resilience Coverage:33%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test 🔍 Detected
  • (Required) Observability
  • (Required) QA Review
  • (Required) Run Nightly Tests
  • Unit Test 🔍 Detected
Risk Summary:

The pull request has a high risk score of 81, primarily driven by the predictors "Sum Bug Reports Of Files" and "Delta of Executable Lines," along with modifications in one file hotspot. Historically, PRs with these predictors are 109% more likely to cause a bug than the repository baseline. Notably, the observed bug trend in the repository is decreasing.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../durable/persist.rs 98

Comment on lines +413 to +408
/// Index into a [`RelationType`] for this column.
typ_idx: usize,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to put ColumnType here instead of an index, but RelationType is passed around so much in the compute code that this would be a large refactor

/// Diffs that can be generated proptest and applied to a [`RelationDesc`] to
/// exercise schema migrations.
#[derive(Debug)]
pub enum PropRelationDescDiff {
AddColumn { name: ColumnName, typ: ColumnType },
DropColumn { name: ColumnName },
MovePosition { name: ColumnName, new_pos: usize },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only test code, after this change we can't really move the position of a column

Comment on lines 3146 to 3159
.with_column("region", ScalarType::String.nullable(true))
.with_column("access_key_id", ScalarType::String.nullable(true))
.with_column("access_key_id_secret_id", ScalarType::String.nullable(true))
.with_column(
"secret_access_key_secret_id",
ScalarType::String.nullable(true),
)
.with_column("session_token", ScalarType::String.nullable(true))
.with_column("session_token_secret_id", ScalarType::String.nullable(true))
.with_column("assume_role_arn", ScalarType::String.nullable(true))
.with_column(
"assume_role_session_name",
ScalarType::String.nullable(true),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I didn't mean to delete these columns, fixed this in a later commit

Comment on lines 1956 to 1957
(ColumnName::from("foo"), ScalarType::Int64.nullable(true)),
(ColumnName::from("bar"), ScalarType::String.nullable(false)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to swap the nullability of the columns here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh I did not, sorry about the miss, fixed!

Comment on lines 949 to 950
/// Drops the column `name` from this [`RelationDesc`]. If there are multiple columns with
/// `name` drops the left-most one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple columns with
/// name drops the left-most one.

Is it possible to have multiple columns with the same name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is something we want to support, e.g.

CREATE TABLE t1 (a int);
ALTER TABLE t1 DROP COLUMN a;
ALTER TABLE t1 ADD COLUMN a text;

We would keep the original a int as ColumnIndex(0) and have a text as ColumnIndex(1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case if I tried to do something like

CREATE TABLE t1 (a int);
ALTER TABLE t1 DROP COLUMN a;
ALTER TABLE t1 ADD COLUMN a text;
ALTER TABLE ts DROP COLUMN a;

then the second drop would panic on line 988?

Maybe the description should read something like

If there are multiple columns with name drops the left-most non-dropped one.

and line 984 should be something like .find(|meta| meta.name == name && meta.dropped.is_none())

src/repr/src/relation.rs Outdated Show resolved Hide resolved
Comment on lines +1006 to +1045
// N.B. At this point we need to be careful because col_idx might not
// equal typ_idx.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that? I don't think I understood how col_idx and typ_idx could diverge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh because you're not retaining dropped columns in column_metas but they are still influencing col_idx and not typ_idx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh because you're not retaining dropped columns in column_metas but they are still influencing col_idx and not typ_idx?

Exactly! Ideally we would track ColumnType in the BTreeMap<...> but RelationType is used so widely in compute that it's a big refactor. I added an example in the comment and this is exercised in the snapshot tests below, let me know if I should add more!


let relation_type = RelationType {
column_types,
keys: self.inner.typ.keys.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if keys reference a dropped column or a column that was added after version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout! I don't think we'll allow dropping a column that is a key, at least not v0, so I made that panic. I didn't consider the case of keys getting re-ordered though so thank you for calling that out, I fixed this and added a test case

///
/// Panics if a constraint is not satisfied.
fn validate(&self) {
fn validate_inner(desc: &RelationDesc) -> Result<(), anyhow::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any validations to add on desc.typ.keys? Maybe that each index is valid and that a single key doesn't have duplicate indexes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added some validation


/// Metadata (other than type) for a column in a [`RelationDesc`].
#[derive(Clone, Debug, Eq, PartialEq, Serialize, Deserialize, Hash, MzReflect)]
struct ColumnMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions about the data model:

  • When do we set ColumnMetadata::dropped, and when do we remove the entry from the map entirely?
  • At what point do we expect the caller to call version_at? Does that always happen by the time it's passed to Persist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry if these are redundant; I may be missing context!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions!

When do we set ColumnMetadata::dropped

We set ColumnMetadata::dropped when a user runs ALTER TABLE ... DROP COLUMN .... We aren't planning to support this immediately but I figured while I was here I would add support.

...when do we remove the entry from the map entirely?

We won't ever drop the column entirely, but if a user runs at_version(...) with a version where the column doesn't exist, it won't be included in the returned RelationDescs map.

At what point do we expect the caller to call version_at? Does that always happen by the time it's passed to Persist?

This should get called during planning! Persist should never have to worry about calling at_version(...). In fact, I put at_version(...) on a newtype so it shouldn't ever be possible to call at_version(...) outside of planning

N: Into<ColumnName>,
T: Into<ColumnType>,
{
let latest_version = self.latest_version();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the cost linear in the number of existing fields. (So adding N fields is quadratic.) Not sure if we care?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could cache the latest version, but this should only get called when a user runs ALTER TABLE ... so I'm not concerned about the runtime here

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 949 to 950
/// Drops the column `name` from this [`RelationDesc`]. If there are multiple columns with
/// `name` drops the left-most one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case if I tried to do something like

CREATE TABLE t1 (a int);
ALTER TABLE t1 DROP COLUMN a;
ALTER TABLE t1 ADD COLUMN a text;
ALTER TABLE ts DROP COLUMN a;

then the second drop would panic on line 988?

Maybe the description should read something like

If there are multiple columns with name drops the left-most non-dropped one.

and line 984 should be something like .find(|meta| meta.name == name && meta.dropped.is_none())

* Add RelationDescBuilder that is used to construct a RelationDesc
* Remove the RelationDesc::with_column method
@ParkMyCar ParkMyCar force-pushed the adapter/refactor-relation-desc branch from fef33bb to c8a5976 Compare August 29, 2024 13:42
@ParkMyCar ParkMyCar enabled auto-merge (squash) August 29, 2024 13:43
@ParkMyCar
Copy link
Member Author

@jkosh44 good call with the duplicate columns, it was pretty easy to support that. Updated the comments and added tests!

* refactors RelationDesc to internally have a BTreeMap<ColumnIndex, ColumnMetadata>
* track added and dropped versions in ColumnMetadata
…lationDesc * Remove the RelationDesc::with_column method
…with version from before we added the metadata field
* Update comments around handling of columns with the same name
* Update handling of keys, make sure we remap keys to handle dropped columns
@ParkMyCar ParkMyCar force-pushed the adapter/refactor-relation-desc branch from 287fbb5 to 92d8e6a Compare August 29, 2024 14:23
@ParkMyCar ParkMyCar merged commit ee73b77 into MaterializeInc:main Aug 29, 2024
83 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants