feat: Expose Parquet Schema Adapter #10515

HawaiianSpork · 2024-05-15T03:26:13Z

Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema.

This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.

Which issue does this PR close?

Closes #10398

Rationale for this change

By exposing SchemaAdapter downstream consumers can reuse ParquetExec but allow for different interpretations of the data from the parquet.

For example, delta-rs keeps the schema separate from the parquet so that schema evolution can be well controlled. The external schema can enrich the data inside the parquet files with missing nested columns or timezone information.

What changes are included in this PR?

Changes SchemaAdapter to a public trait and adds SchemaAdapterFactory to be passed into the constructor of ParquetExec.

Are there any user-facing changes?

This change does expose a new field for ParquetExec that can be specified. I was able to reuse the existing documentation for the now public trait.

This change adds the optional schema_adaptor_factory to the ParquetExec struct. If a client was creating this struct directly they would have to change their code to now specify None for the schema_adaptor_factory. If a consumer was using the builder they would be unaffected.

Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema. This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.

alamb · 2024-05-15T19:12:39Z

Thank you @HawaiianSpork -- I triggered the CI checks on this PR and plan to review it carefully in the next day or two

alamb

Thank you for this wonderful contribution @HawaiianSpork -- a very impressive first PR 👏

I had some minor comment suggestions, but I also think this PR could be merged as is

I have some suggestions about code organization that maybe we can do as a follow on PR.

alamb · 2024-05-15T19:46:01Z

datafusion/core/src/datasource/physical_plan/mod.rs

            projection,
        ))
    }
 }

 /// The SchemaMapping struct holds a mapping from the file schema to the table schema
 /// and any necessary type conversions that need to be applied.
+#[cfg(feature = "parquet")]


It is unfortunate that we have to add the

#[cfg(feature = "parquet")]

all over the place

Maybe as a follow on PR we can pull this code into its own module (e.g. datasource/schema_adaptor.rs for example)

alamb · 2024-05-15T19:48:31Z

datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs

+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// Factory of schema adapters.


I know that currently the only user of SchemaAdapter is parquet, but I don't think there is anything parquet specific about the logic here.

What do you think about moving the code (and default impl) somewhere like

datafusion/core/src/datasource/schema_adapter.rs

?

Perhaps we could do that as a follow on PR as the way you have done this PR makes it easy to see what you have changed / not changed 👍

alamb · 2024-05-15T19:50:02Z

datafusion/core/tests/parquet/schema_adapter.rs

+    // Create several parquet files in same directoty / table with
+    // same schema but different metadata


Is this comment valid? This seems like it really makes a two files with column id and then uses the schema adapter to add a separate column

alamb · 2024-05-15T19:51:44Z

datafusion/core/tests/parquet/schema_adapter.rs

+        "+----+--------------+",
+        "| id | extra_column |",
+        "+----+--------------+",
+        "| 1  | foo          |",


It might help future readers if you left some comments explaining that the point of the test was to inject a column that doesn't appear in any of the files

alamb · 2024-05-15T19:53:40Z

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

@@ -93,6 +95,8 @@ pub struct ParquetExec {
    cache: PlanProperties,
    /// Options for reading Parquet files
    table_parquet_options: TableParquetOptions,
+    /// Optional user defined schema adapter
+    schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>,


I think it is about time (not this PR) to create a ParquetExecBuilder so that new optional arguments can be aded without having to change all callsites.

comphead · 2024-05-17T15:45:16Z

are we waiting feedbacks from @HawaiianSpork or we good to merge the PR?

alamb · 2024-05-17T16:29:51Z

are we waiting feedbacks from @HawaiianSpork or we good to merge the PR?

I was waiting to see if they wanted to make any of the suggestions. If not, let's merge it in

comphead · 2024-05-17T16:33:00Z

@HawaiianSpork please address the suggestions, you dont need to change the code, just make a response

HawaiianSpork · 2024-05-17T20:07:25Z

Sorry, to keep you hanging and thank you for the quick review. I agree with the suggestions but haven't circled back around to completing. If you want to wait, I will get to the suggestions early next week. Otherwise, I am also ok with this being merged and I can open a follow on PR to address the suggestions.

This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.

* refactor: Move SchemaAdapter from parquet module to data source This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in #10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources. * fix comments surrounding tests to be accurate.

* feat: Expose Parquet Schema Adapter

…he#10680) * refactor: Move SchemaAdapter from parquet module to data source This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources. * fix comments surrounding tests to be accurate.

github-actions bot added the core Core DataFusion crate label May 15, 2024

HawaiianSpork added 2 commits May 14, 2024 22:57

Fix building with no-default features

e0b4293

clippy

ae0a6bc

alexwilcoxson-rel mentioned this pull request May 15, 2024

feat: Expose Parquet Schema Adapter relativityone/datafusion#2

Merged

alamb mentioned this pull request May 15, 2024

DataFusion weekly project plan (Andrew Lamb) - May 13, 2024 #10482

Closed

8 tasks

alamb approved these changes May 15, 2024

View reviewed changes

comphead merged commit 4e55768 into apache:main May 17, 2024
23 checks passed

HawaiianSpork deleted the expose-schema-adapter branch May 17, 2024 20:27

HawaiianSpork mentioned this pull request May 26, 2024

refactor: Move SchemaAdapter from parquet module to data source #10680

Merged

HawaiianSpork mentioned this pull request Jun 24, 2024

Allow providing Arrow schema when scanning Parquet files #5950

Open

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024

feat: Expose Parquet Schema Adapter (apache#10515)

76230d1

* feat: Expose Parquet Schema Adapter

nrc mentioned this pull request Aug 15, 2024

Allow suppling a table schema to ParquetExec #12010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Expose Parquet Schema Adapter #10515

feat: Expose Parquet Schema Adapter #10515

HawaiianSpork commented May 15, 2024

alamb commented May 15, 2024

alamb left a comment

alamb May 15, 2024

alamb May 15, 2024

alamb May 15, 2024

alamb May 15, 2024

alamb May 15, 2024

comphead commented May 17, 2024

alamb commented May 17, 2024

comphead commented May 17, 2024

HawaiianSpork commented May 17, 2024

		// Create several parquet files in same directoty / table with
		// same schema but different metadata

feat: Expose Parquet Schema Adapter #10515

feat: Expose Parquet Schema Adapter #10515

Conversation

HawaiianSpork commented May 15, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented May 15, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb May 15, 2024

Choose a reason for hiding this comment

alamb May 15, 2024

Choose a reason for hiding this comment

alamb May 15, 2024

Choose a reason for hiding this comment

alamb May 15, 2024

Choose a reason for hiding this comment

alamb May 15, 2024

Choose a reason for hiding this comment

comphead commented May 17, 2024

alamb commented May 17, 2024

comphead commented May 17, 2024

HawaiianSpork commented May 17, 2024