Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema Evolution RecordBatch processor #602

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Sep 4, 2024

Addresses parts 2 and 3 of #405.

Add support for type promotion and default values to the read pipeline.

  • When the scan includes fields that have undergone type promotion since some of the underlying parquet files were written, any selected fields that have undergone type promotion will be promoted to the newer type before being returned to the user. For example, A table contained a field "a" that was a Float, and some rows were written. The table's schema was changed so that field "a" is now a Double. When a table scan is performed, record batches coming from files written prior to the type promotion will be dynamically converted so that field "a" is of type Double, matching any row batches returned from files written after the schema change.
  • When the scan includes fields that have been added since some of the files were written, record batches will be dynamically converted as per the above to contain selected fields that were not present at the time the file that they were in was written. These will have a value of null if there is no default value present for the column but the column is not required. If the table schema specifies an initial-default-value for the field, then all rows will have that value for the new column instead.
  • If any fields have been renamed, the record batch schemas for rows written before the rename occurred will be rewritten to contain the new field name.
  • If projected_field_ids is provided, the columns in the response will be re-ordered to match the order in the projection.

@sdd
Copy link
Contributor Author

sdd commented Sep 6, 2024

I need to go back to the drawing board on this. The current implementation breaks when not all columns in the file are in the list of projected fields.

@sdd
Copy link
Contributor Author

sdd commented Sep 9, 2024

OK, I've addressed the problem with projections other than the equiv of SELECT *. All existing tests passing, and new tests extended to cover these cases.

@sdd sdd marked this pull request as ready for review September 9, 2024 18:03
@sdd sdd force-pushed the record-batch-evolution-processor branch from 73c6724 to bf6d2ef Compare September 9, 2024 18:08
@sdd
Copy link
Contributor Author

sdd commented Sep 9, 2024

@liurenjie1024 and @Xuanwo - ready for review when you get chance

@sdd sdd mentioned this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant