-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add internal Avro reader #11108
Conversation
a0091f9
to
25e6ce9
Compare
81eacc4
to
6b6cd1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to run our benchmarks like ManifestReadBenchmark
to ensure there is no regression.
* | ||
* @param <T> Java type returned by the reader | ||
*/ | ||
public class InternalReader<T> implements DatumReader<T>, SupportsRowPosition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need anything in the name to indicate it is specific to Avro? Will the name clash once we add a similar path for Parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could name it InternalAvroReader
but in most places we just let the names conflict because we're only working with Parquet or Avro, not both. Is this something you think we should do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about a potential reader/writer factory that would create an appropriate instance based on the file type.
|
||
private final Types.StructType expectedType; | ||
private final Map<Integer, Class<? extends StructLike>> typeMap = Maps.newHashMap(); | ||
private final Map<Integer, Object> idToConstant = ImmutableMap.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this field if it is immutable? Will it change once we add default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is going to change when we have default values and will be populated in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan another PR to add V3 default-values support (just curious) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbonofre, sorry I missed this earlier.
I was wrong in my last comment. Default values come directly from the expected schema and support for them was added when I rebased this on top of @wmoustafa's recent PR (#9502). This constant map is here as a placeholder for when we add the ability to pass in constants that override fields, which is how we implement partition values. Those must be stored in data files, but are not read.
LogicalType logicalType = primitive.getLogicalType(); | ||
if (logicalType != null) { | ||
switch (logicalType.getName()) { | ||
case "date": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These string constants seem fragile but I don't have a better suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner this way. No need to introduce constants that are not widely used and would need to be public for any reuse.
@@ -44,7 +44,8 @@ public class TestManifestReader extends TestBase { | |||
"fileOrdinal", | |||
"fileSequenceNumber", | |||
"fromProjectionPos", | |||
"manifestLocation") | |||
"manifestLocation", | |||
"partitionData.partitionType.fieldsById") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the tests were moved to JUnit 5, the previous assertions were updated to use reflection and a recursive equality check. This lazy field is now present in the value that gets returned, although it doesn't really affect anything. This change gets the tests to work and I don't want to fix the real issue -- that the switch to JUnit5 wasn't done right and checks more than the original tests did rather than porting the original assertions.
Needless to say, I'm not a fan of how that conversion went, but it isn't something we should fix here.
3f1bd3b
to
474d444
Compare
474d444
to
c3bcee3
Compare
LGTM. |
Here are the benchmark results:
The difference is within the error range, so I don't think there's a significant penalty (there should hopefully be a slight improvement, but it's hard to tell). |
Thanks for the review, @aokolnychyi! |
This adds an Avro reader that produces Iceberg's internal object model and uses
StructLike
rather than Avro'sIndexedRecord
.