Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Widening in ALTER TABLE CHANGE COLUMN #2645

Closed

Conversation

johanl-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This change introduces the typeWidening delta table feature, allowing to widen the type of existing columns and fields in a delta table using the ALTER TABLE CHANGE COLUMN TYPE or ALTER TABLE REPLACE COLUMNS commands.

The table feature is introduced as typeWidening-dev during implementation and is available in testing only.

For now, only byte -> short -> int are supported. Other changes will require support in the Spark parquet reader that will be introduced in Spark 4.0

Type widening feature request: #2622
Type Widening protocol RFC: #2624

How was this patch tested?

A new test suite DeltaTypeWideningSuite is created, containing:

  • DeltaTypeWideningAlterTableTests: Covers applying supported and unsupported type changes on partitioned columns, non-partitioned columns and nested fields
  • DeltaTypeWideningTableFeatureTests: Covers adding the typeWidening table feature

Does this PR introduce any user-facing changes?

The table feature is available in testing only, there's no user-facing changes as of now.

The type widening table feature will introduce the following changes:

  • Adding the typeWidening via a table property:
ALTER TABLE t SET TBLPROPERTIES (‘delta.enableTypeWidening' = true)
  • Apply a widening type change:
ALTER TABLE t CHANGE COLUMN int_col TYPE long

Note: both ALTER TABLE commands reuse the existing syntax for setting a table property and applying a type change, no new SQL syntax is being introduced by this feature.

Copy link
Contributor

@sabir-akhadov sabir-akhadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@bart-samwel bart-samwel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I left a bunch of test feedback.

val isEnabled = DeltaConfigs.ENABLE_TYPE_WIDENING.fromMetaData(metadata)
if (isEnabled && !isSupported(protocol)) {
throw new IllegalStateException(
s"Table property '${DeltaConfigs.ENABLE_TYPE_WIDENING.key}' is " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be using the error framework?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen unless there's a bug in the implementation, so I wouldn't give it an error class. We typically wouldn't want to document that error as a user-facing error


test("row group skipping Short -> Int") {
withSQLConf(
SQLConf.FILES_MAX_PARTITION_BYTES.key -> 1024.toString) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we verify that this actually leads to row group skipping? FWIW it looks like this should lead to regular min/max data skipping on the Delta log. The max partition bytes only affects how we read chunks, i.e., it'll split a parquet file of >1024 bytes over multiple tasks (1024 bytes each), but if the file is still a single row group, then only one of the tasks will actually read it (the task that gets the "middle" byte of the row group), and the other files will simply ignore the row group. And with the current way of writing, each file will have only one row group, so the row group skipping will be equal to the Delta-level file skipping, so it should never skip, really.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't make a lot of sense here anymore, I added it back when I found the overflow issue in parquet row group skipping and kept it around but it's redundant with the test I added when I fixed the issue in Spark:
https://github.com/databricks/runtime/blob/a9d6a9d6191264ced1f0658ff6675b0f30e8e77f/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala#L1216

I'm removing it, I don't think it's worth the effort to get it to accurately cover an issue that the Spark test already covers a lot better

@@ -625,6 +626,18 @@ object ManagedCommitTableFeature
}
}

object TypeWideningTableFeature extends ReaderWriterFeature(name = "typeWidening-dev")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are using -dev right now and its not ready for users using it. is it behind the isTesting flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants