[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677

ryan-johnson-databricks · 2023-04-05T15:45:39Z

What changes were proposed in this pull request?

Allow FileFormat instances to define the schema of the _metadata column they expose.

Why are the changes needed?

Today, the schema of the file source _metadata column depends on the file format (e.g. parquet file format supports _metadata.row_index) but this is hard-wired into the FileFormat itself. Not only is this an ugly design, it also prevents custom file formats from adding their own fields to the _metadata column.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

cloud-fan · 2023-04-06T16:34:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

@@ -264,9 +261,11 @@ object FileFormat {
      fileSize: Long,
      fileBlockStart: Long,
      fileBlockLength: Long,
-      fileModificationTime: Long): InternalRow = {
+      fileModificationTime: Long,
+      otherConstantMetadataColumnValues: Map[String, Any]): InternalRow = {


How is otherConstantMetadataColumnValues generated? FileFormat doesn't have a API for it.

See the unit tests -- the FileIndex.listFiles is responsible to provide it as part of the PartitionDirectory it creates for each file.

HyukjinKwon · 2023-04-10T03:01:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

+    case _: LongType | _: IntegerType | _: ShortType | _: ByteType => true
+    case _: DoubleType | _: FloatType => true
+    case _: StringType => true
+    case _: TimestampType => true // really just Long


Should we also add DateType (int), DayTimeIntervalType (long) and YearMonthIntervalType (int)`?

Good catch. Fixed by using Literal, PhysicalType, and ColumnVectorUtils.populate, which also has the nice side effect of simplifying the code.

HyukjinKwon · 2023-04-10T03:02:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+   */
+  def createFileMetadataCol(): AttributeReference = {
+    // Strip out the fields' metadata to avoid exposing it to the user. [[FileSourceStrategy]]
+    // avoids confusion by mapping back to [[metadataSchemaFields]].


Nit .. but in these regular comments, we could just use backticks. [[...]] is the syntax for Scaladoc (not for the comments).

I personally find the brackets more readable (and my editor likes them better than backticks as well).
Is there a rule against using them in normal comments?

HyukjinKwon · 2023-04-10T03:03:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+          // Other metadata columns use the file-provided value (if any). As a courtesy, convert any
+          // normal strings to the required [[UTF8String]].
+          //
+          // TODO(frj): Do we need to potentially support value-producing functions?


Could we file a JIRA and fix it like TODO(SPARK-XXXXX) instead?

Removing for now, because the TODO doesn't add any meaningful information.
If/when the need arises, it will be obvious enough that this code needs to change.

cloud-fan · 2023-04-10T05:39:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala

+ * custom file-constant metadata columns, but in general tasks and readers can use the per-file
+ * metadata however they see fit.
+ */
+case class FileStatusWithMetadata(fileStatus: FileStatus, metadata: Map[String, Any] = Map.empty) {


Let's think more about the API design. I think it's too fragile to use Any in the API, without a well-defined rule for what the actually allowed values are.

I'd suggest using Map[String, Literal]. Then we can remove def isSupportedType as all types can be supported.

See my TODO above... we may need to consider supporting value-producing functions, to allow full pruning in cases where the value is somehow expensive to compute. Requiring Literal would block that (and AFAIK only Any could capture both Literal and () => Literal).

The FILE_PATH case that calls Path.toString, and the call sites of PartitionedFile is a small example of that possibility that got me thinking -- what if instead of passing length, path, etc as arguments, we just passed the actual file status, and used the extractors on it? Probably doesn't make sense to actually do that for the hard-wired cases, tho.

I do like the idea of supporting Literal as one of the supported cases -- it simplifies the type checking a bit, in that the "supported" primitive types are merely those for which the implementation will automatically create the Literal wrapper as a courtesy (similar to string vs. UTF8String).

Update: I remember now another reason why I had added isSupportedDataType -- ConstantColumnVector (needed by FileScanRDD...createMetadataColumnVector below) supports a limited subset of types, and relies on type-specific getters and setters. Even if I wrote the (complex recursive) code to handle structs, maps, and arrays... we still wouldn't have complete coverage for all types.

Do we know for certain that ConstantColumnVector supports all types that can ever be encountered during vectorized execution? If not, we must keep the isSupportedDataType method I introduced, regardless of whether we choose to add support for metadata fields with complex types in this PR.

Update: ConstantColumnVector looks like an incompletely implemented API... it "supports" array/map/struct on the surface (e.g. ConstantColumnVectorSuite has superficial tests for it), but e.g. ColumnVectorUtils.populate doesn't actually handle them and ColumnVectorUtilsSuite.scala has negative tests to verify that they cannot be used in practice.

As far as I can tell, the class really only supports data types that can be used as partition columns.

Updated the doc comment here to explain that file-source metadata fields is only one possible usage for the extra file metadata (which is conceptually at a deeper layer than catalyst and Literal).

Also updated isSupportedType doc comment to explain why not all types are supported.

Relevant implementation details:

It would take a lot of work to support all data types, regardless of whether we use Literal vs. Any.

We anyway end up wrapping the provided value in a call to Literal(_), because doing so simplifies null handling by making null-because-missing equivalent to null-because-null. At that point, we get wrapping of primitive values "for free" if we happen to pass Any instead.

cloud-fan · 2023-04-11T02:58:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

+    case PhysicalNullType => true
+    case PhysicalBooleanType => true
+    case PhysicalByteType | PhysicalShortType | PhysicalIntegerType | PhysicalLongType => true
+    case PhysicalFloatType | PhysicalDoubleType => true


nit: case _: PhysicalPrimitiveType => true

Hmm... it's currently true that ColumnVectorUtils.populate supports all physical primitive types, but somebody neglected to make the latter a sealed trait. I'll fix that and simplify the match here.

cloud-fan · 2023-04-11T03:08:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+   * NOTE: It is not possible to change the semantics of the base metadata fields by overriding this
+   * method. Technically, a file format could choose suppress them, but that is not recommended.
+   */
+  def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS


I'm wondering if we should have 2 APIs:

def constantMetadataClolumns: Seq[StructField] def generatedMetadataColumns: Seq[StructField]

Then Spark can add metadata fields which means less work for the implementations.

Thinking about it more, how can a file source define custom constant metadata columns? The file listing logic is shared for all file sources and I can't think of a way to customize it for certain file sources.

It needs a custom FileIndex to go with the FileFormat (see the unit test for an example).

got it. How about my first comment? Or do we expect the implementations to properly separate constant and generated metadata columns by using those util objects?

Could you elaborate what we gain by splitting out two lists? For generated columns, in particular, we must use the helper object because the user should specify the physical column name to use.

I'm trying to make it easier for third-party file sources to implement the new functions. The fewer internal details we expose through API, the more API stability we have.

but we don't have a choice here. The implementation needs to specify the physical column name, and we must expose these details.

At least the surface is pretty minimal (nobody needs to know the specific metadata tags that get used): Instead of saying

StructField(name, dataType, nullable)

they pick one of:

FileSourceConstantMetadataStructField(name, dataType, nullable) FileSourceGeneratedMetadataStructField(name, internalName, dataType, nullable)

cloud-fan · 2023-04-12T03:03:22Z

thanks, merging to master!

…sting DV Information (#2888)  #### Which Delta project/connector is this regarding?  - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  Back then, we relied on an [expensive Broadcast of DV files](#1542) to pass the DV files to the associated Parquet Files. With the introduction of [adding custom metadata to files](apache/spark#40677) introduced in Spark 3.5, we can now pass the DV through the custom metadata field, this is expected to improve the performance of DV reads in Delta. ## How was this patch tested?  Adjusted the existing UTs that cover our changes. ## Does this PR introduce _any_ user-facing changes? No.

github-actions bot added SQL STRUCTURED STREAMING labels Apr 5, 2023

ryan-johnson-databricks force-pushed the generalized-metadata-cols branch 2 times, most recently from 79c83f7 to 6d35127 Compare April 6, 2023 15:26

cloud-fan reviewed Apr 6, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala Show resolved Hide resolved

cloud-fan reviewed Apr 6, 2023

View reviewed changes

ryan-johnson-databricks requested a review from cloud-fan April 6, 2023 18:30

HyukjinKwon changed the title ~~[SPARK-43039] Support custom fields in the file source _metadata column.~~ [SPARK-43039][SQL] Support custom fields in the file source _metadata column. Apr 10, 2023

HyukjinKwon reviewed Apr 10, 2023

View reviewed changes

cloud-fan reviewed Apr 10, 2023

View reviewed changes

ryan-johnson-databricks force-pushed the generalized-metadata-cols branch from 6d35127 to 51359c1 Compare April 10, 2023 22:55

cloud-fan reviewed Apr 11, 2023

View reviewed changes

ryan-johnson-databricks added 3 commits April 11, 2023 05:31

Allow FileFormat to control the schema and content of _metadata column.

8b9d1b0

Use Literal more effectively

576829d

Simplify physical type matching

0612742

ryan-johnson-databricks force-pushed the generalized-metadata-cols branch from 51359c1 to 0612742 Compare April 11, 2023 12:35

ryan-johnson-databricks requested review from cloud-fan and HyukjinKwon April 11, 2023 12:35

cloud-fan approved these changes Apr 11, 2023

View reviewed changes

cloud-fan closed this in a31ac04 Apr 12, 2023

johanl-db mentioned this pull request Jul 12, 2023

Assign materialized Row ID and Row commit version column names delta-io/delta#1896

Closed

johanl-db mentioned this pull request Sep 15, 2023

Enable Row Tracking outside of testing delta-io/delta#2059

Open

5 tasks

johanl-db mentioned this pull request Sep 27, 2023

[Feature Request] Reading and Preserving Row Tracking information delta-io/delta#2111

Closed

8 tasks

pan3793 mentioned this pull request Apr 10, 2024

Fix HdfsLocatedFileStatus and FileStatusWithmetadata class type conve… apache/kyuubi#6285

Closed

4 tasks

longvu-db mentioned this pull request Apr 12, 2024

[Spark] DV Reads Stability Improvement in Delta by removing Broadcasting DV Information delta-io/delta#2888

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677

[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677

ryan-johnson-databricks commented Apr 5, 2023

cloud-fan Apr 6, 2023

ryan-johnson-databricks Apr 6, 2023 •

edited

Loading

HyukjinKwon Apr 10, 2023

ryan-johnson-databricks Apr 10, 2023 •

edited

Loading

HyukjinKwon Apr 10, 2023

ryan-johnson-databricks Apr 11, 2023

HyukjinKwon Apr 10, 2023

ryan-johnson-databricks Apr 11, 2023

cloud-fan Apr 10, 2023

ryan-johnson-databricks Apr 10, 2023

ryan-johnson-databricks Apr 10, 2023

ryan-johnson-databricks Apr 10, 2023 •

edited

Loading

ryan-johnson-databricks Apr 10, 2023

ryan-johnson-databricks Apr 10, 2023 •

edited

Loading

cloud-fan Apr 11, 2023

ryan-johnson-databricks Apr 11, 2023

cloud-fan Apr 11, 2023

cloud-fan Apr 11, 2023

ryan-johnson-databricks Apr 11, 2023 •

edited

Loading

cloud-fan Apr 11, 2023

ryan-johnson-databricks Apr 11, 2023 •

edited

Loading

cloud-fan Apr 11, 2023

cloud-fan Apr 11, 2023

ryan-johnson-databricks Apr 11, 2023

cloud-fan commented Apr 12, 2023

[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677

[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677

Conversation

ryan-johnson-databricks commented Apr 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

ryan-johnson-databricks Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Apr 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Apr 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 12, 2023

ryan-johnson-databricks Apr 6, 2023 •

edited

Loading

ryan-johnson-databricks Apr 10, 2023 •

edited

Loading

ryan-johnson-databricks Apr 10, 2023 •

edited

Loading

ryan-johnson-databricks Apr 10, 2023 •

edited

Loading

ryan-johnson-databricks Apr 11, 2023 •

edited

Loading

ryan-johnson-databricks Apr 11, 2023 •

edited

Loading