Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expose Parquet Schema Adapter #10515

Merged
merged 3 commits into from
May 17, 2024

Conversation

HawaiianSpork
Copy link
Contributor

Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema.

This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.

Which issue does this PR close?

Closes #10398

Rationale for this change

By exposing SchemaAdapter downstream consumers can reuse ParquetExec but allow for different interpretations of the data from the parquet.

For example, delta-rs keeps the schema separate from the parquet so that schema evolution can be well controlled. The external schema can enrich the data inside the parquet files with missing nested columns or timezone information.

What changes are included in this PR?

Changes SchemaAdapter to a public trait and adds SchemaAdapterFactory to be passed into the constructor of ParquetExec.

Are there any user-facing changes?

This change does expose a new field for ParquetExec that can be specified. I was able to reuse the existing documentation for the now public trait.

This change adds the optional schema_adaptor_factory to the ParquetExec struct. If a client was creating this struct directly they would have to change their code to now specify None for the schema_adaptor_factory. If a consumer was using the builder they would be unaffected.

Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema.

This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema
and how the schema should evolve.
@github-actions github-actions bot added the core Core DataFusion crate label May 15, 2024
@alamb
Copy link
Contributor

alamb commented May 15, 2024

Thank you @HawaiianSpork -- I triggered the CI checks on this PR and plan to review it carefully in the next day or two

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this wonderful contribution @HawaiianSpork -- a very impressive first PR 👏

I had some minor comment suggestions, but I also think this PR could be merged as is

I have some suggestions about code organization that maybe we can do as a follow on PR.

projection,
))
}
}

/// The SchemaMapping struct holds a mapping from the file schema to the table schema
/// and any necessary type conversions that need to be applied.
#[cfg(feature = "parquet")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that we have to add the

#[cfg(feature = "parquet")]

all over the place

Maybe as a follow on PR we can pull this code into its own module (e.g. datasource/schema_adaptor.rs for example)

use std::fmt::Debug;
use std::sync::Arc;

/// Factory of schema adapters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that currently the only user of SchemaAdapter is parquet, but I don't think there is anything parquet specific about the logic here.

What do you think about moving the code (and default impl) somewhere like

datafusion/core/src/datasource/schema_adapter.rs

?

Perhaps we could do that as a follow on PR as the way you have done this PR makes it easy to see what you have changed / not changed 👍

Comment on lines +42 to +43
// Create several parquet files in same directoty / table with
// same schema but different metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment valid? This seems like it really makes a two files with column id and then uses the schema adapter to add a separate column

"+----+--------------+",
"| id | extra_column |",
"+----+--------------+",
"| 1 | foo |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help future readers if you left some comments explaining that the point of the test was to inject a column that doesn't appear in any of the files

@@ -93,6 +95,8 @@ pub struct ParquetExec {
cache: PlanProperties,
/// Options for reading Parquet files
table_parquet_options: TableParquetOptions,
/// Optional user defined schema adapter
schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is about time (not this PR) to create a ParquetExecBuilder so that new optional arguments can be aded without having to change all callsites.

@comphead
Copy link
Contributor

are we waiting feedbacks from @HawaiianSpork or we good to merge the PR?

@alamb
Copy link
Contributor

alamb commented May 17, 2024

are we waiting feedbacks from @HawaiianSpork or we good to merge the PR?

I was waiting to see if they wanted to make any of the suggestions. If not, let's merge it in

@comphead
Copy link
Contributor

@HawaiianSpork please address the suggestions, you dont need to change the code, just make a response

@HawaiianSpork
Copy link
Contributor Author

Sorry, to keep you hanging and thank you for the quick review. I agree with the suggestions but haven't circled back around to completing. If you want to wait, I will get to the suggestions early next week. Otherwise, I am also ok with this being merged and I can open a follow on PR to address the suggestions.

@comphead comphead merged commit 4e55768 into apache:main May 17, 2024
23 checks passed
@HawaiianSpork HawaiianSpork deleted the expose-schema-adapter branch May 17, 2024 20:27
HawaiianSpork added a commit to relativityone/datafusion that referenced this pull request May 26, 2024
This is not a change in behavior except moving the public location of SchemaAdapter.  SchemaAdapter was exposed
in apache#10515 to allow callers to define their own implementation.  This PR then changes the location so that it could be used in other
data sources.
HawaiianSpork added a commit to relativityone/datafusion that referenced this pull request May 26, 2024
This is not a change in behavior except moving the public location of SchemaAdapter.  SchemaAdapter was exposed
in apache#10515 to allow callers to define their own implementation.  This PR then changes the location so that it could be used in other
data sources.
HawaiianSpork added a commit to relativityone/datafusion that referenced this pull request May 26, 2024
This is not a change in behavior except moving the public location of SchemaAdapter.  SchemaAdapter was exposed
in apache#10515 to allow callers to define their own implementation.  This PR then changes the location so that it could be used in other
data sources.
alamb pushed a commit that referenced this pull request May 27, 2024
* refactor: Move SchemaAdapter from parquet module to data source

This is not a change in behavior except moving the public location of SchemaAdapter.  SchemaAdapter was exposed
in #10515 to allow callers to define their own implementation.  This PR then changes the location so that it could be used in other
data sources.

* fix comments surrounding tests to be accurate.
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: Expose Parquet Schema Adapter
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…he#10680)

* refactor: Move SchemaAdapter from parquet module to data source

This is not a change in behavior except moving the public location of SchemaAdapter.  SchemaAdapter was exposed
in apache#10515 to allow callers to define their own implementation.  This PR then changes the location so that it could be used in other
data sources.

* fix comments surrounding tests to be accurate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support custom SchemaAdapter on ParquetExec
3 participants