Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677
[SPARK-43039][SQL] Support custom fields in the file source _metadata column. #40677
Changes from all commits
8b9d1b0
576829d
0612742
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit .. but in these regular comments, we could just use backticks.
[[...]]
is the syntax for Scaladoc (not for the comments).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find the brackets more readable (and my editor likes them better than backticks as well).
Is there a rule against using them in normal comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should have 2 APIs:
Then Spark can add metadata fields which means less work for the implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, how can a file source define custom constant metadata columns? The file listing logic is shared for all file sources and I can't think of a way to customize it for certain file sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs a custom
FileIndex
to go with theFileFormat
(see the unit test for an example).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. How about my first comment? Or do we expect the implementations to properly separate constant and generated metadata columns by using those util objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate what we gain by splitting out two lists? For generated columns, in particular, we must use the helper object because the user should specify the physical column name to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to make it easier for third-party file sources to implement the new functions. The fewer internal details we expose through API, the more API stability we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we don't have a choice here. The implementation needs to specify the physical column name, and we must expose these details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the surface is pretty minimal (nobody needs to know the specific metadata tags that get used): Instead of saying
StructField(name, dataType, nullable)
they pick one of:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is
otherConstantMetadataColumnValues
generated?FileFormat
doesn't have a API for it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the unit tests -- the
FileIndex.listFiles
is responsible to provide it as part of thePartitionDirectory
it creates for each file.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think more about the API design. I think it's too fragile to use
Any
in the API, without a well-defined rule for what the actually allowed values are.I'd suggest using
Map[String, Literal]
. Then we can removedef isSupportedType
as all types can be supported.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my TODO above... we may need to consider supporting value-producing functions, to allow full pruning in cases where the value is somehow expensive to compute. Requiring
Literal
would block that (and AFAIK onlyAny
could capture bothLiteral
and() => Literal
).The
FILE_PATH
case that callsPath.toString
, and the call sites of PartitionedFile is a small example of that possibility that got me thinking -- what if instead of passing length, path, etc as arguments, we just passed the actual file status, and used the extractors on it? Probably doesn't make sense to actually do that for the hard-wired cases, tho.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the idea of supporting
Literal
as one of the supported cases -- it simplifies the type checking a bit, in that the "supported" primitive types are merely those for which the implementation will automatically create theLiteral
wrapper as a courtesy (similar to string vs. UTF8String).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I remember now another reason why I had added
isSupportedDataType
--ConstantColumnVector
(needed by FileScanRDD...createMetadataColumnVector below) supports a limited subset of types, and relies on type-specific getters and setters. Even if I wrote the (complex recursive) code to handle structs, maps, and arrays... we still wouldn't have complete coverage for all types.Do we know for certain that
ConstantColumnVector
supports all types that can ever be encountered during vectorized execution? If not, we must keep theisSupportedDataType
method I introduced, regardless of whether we choose to add support for metadata fields with complex types in this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
ConstantColumnVector
looks like an incompletely implemented API... it "supports" array/map/struct on the surface (e.g.ConstantColumnVectorSuite
has superficial tests for it), but e.g.ColumnVectorUtils.populate
doesn't actually handle them andColumnVectorUtilsSuite.scala
has negative tests to verify that they cannot be used in practice.As far as I can tell, the class really only supports data types that can be used as partition columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc comment here to explain that file-source metadata fields is only one possible usage for the extra file metadata (which is conceptually at a deeper layer than catalyst and
Literal
).Also updated
isSupportedType
doc comment to explain why not all types are supported.Relevant implementation details:
Literal
vs.Any
.Literal(_)
, because doing so simplifies null handling by making null-because-missing equivalent to null-because-null. At that point, we get wrapping of primitive values "for free" if we happen to passAny
instead.