-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Expose Parquet Schema Adapter #10515
Conversation
Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema. This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.
Thank you @HawaiianSpork -- I triggered the CI checks on this PR and plan to review it carefully in the next day or two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this wonderful contribution @HawaiianSpork -- a very impressive first PR 👏
I had some minor comment suggestions, but I also think this PR could be merged as is
I have some suggestions about code organization that maybe we can do as a follow on PR.
projection, | ||
)) | ||
} | ||
} | ||
|
||
/// The SchemaMapping struct holds a mapping from the file schema to the table schema | ||
/// and any necessary type conversions that need to be applied. | ||
#[cfg(feature = "parquet")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate that we have to add the
#[cfg(feature = "parquet")]
all over the place
Maybe as a follow on PR we can pull this code into its own module (e.g. datasource/schema_adaptor.rs
for example)
use std::fmt::Debug; | ||
use std::sync::Arc; | ||
|
||
/// Factory of schema adapters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that currently the only user of SchemaAdapter is parquet, but I don't think there is anything parquet specific about the logic here.
What do you think about moving the code (and default impl) somewhere like
datafusion/core/src/datasource/schema_adapter.rs
?
Perhaps we could do that as a follow on PR as the way you have done this PR makes it easy to see what you have changed / not changed 👍
// Create several parquet files in same directoty / table with | ||
// same schema but different metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment valid? This seems like it really makes a two files with column id
and then uses the schema adapter to add a separate column
"+----+--------------+", | ||
"| id | extra_column |", | ||
"+----+--------------+", | ||
"| 1 | foo |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help future readers if you left some comments explaining that the point of the test was to inject a column that doesn't appear in any of the files
@@ -93,6 +95,8 @@ pub struct ParquetExec { | |||
cache: PlanProperties, | |||
/// Options for reading Parquet files | |||
table_parquet_options: TableParquetOptions, | |||
/// Optional user defined schema adapter | |||
schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is about time (not this PR) to create a ParquetExecBuilder
so that new optional arguments can be aded without having to change all callsites.
are we waiting feedbacks from @HawaiianSpork or we good to merge the PR? |
I was waiting to see if they wanted to make any of the suggestions. If not, let's merge it in |
@HawaiianSpork please address the suggestions, you dont need to change the code, just make a response |
Sorry, to keep you hanging and thank you for the quick review. I agree with the suggestions but haven't circled back around to completing. If you want to wait, I will get to the suggestions early next week. Otherwise, I am also ok with this being merged and I can open a follow on PR to address the suggestions. |
This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.
This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.
This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.
* refactor: Move SchemaAdapter from parquet module to data source This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in #10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources. * fix comments surrounding tests to be accurate.
* feat: Expose Parquet Schema Adapter
…he#10680) * refactor: Move SchemaAdapter from parquet module to data source This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources. * fix comments surrounding tests to be accurate.
Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema.
This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.
Which issue does this PR close?
Closes #10398
Rationale for this change
By exposing
SchemaAdapter
downstream consumers can reuseParquetExec
but allow for different interpretations of the data from the parquet.For example, delta-rs keeps the schema separate from the parquet so that schema evolution can be well controlled. The external schema can enrich the data inside the parquet files with missing nested columns or timezone information.
What changes are included in this PR?
Changes
SchemaAdapter
to a public trait and addsSchemaAdapterFactory
to be passed into the constructor ofParquetExec
.Are there any user-facing changes?
This change does expose a new field for
ParquetExec
that can be specified. I was able to reuse the existing documentation for the now public trait.This change adds the optional
schema_adaptor_factory
to theParquetExec
struct. If a client was creating this struct directly they would have to change their code to now specifyNone
for theschema_adaptor_factory
. If a consumer was using the builder they would be unaffected.