-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946
Conversation
Thanks @yyanyy, I'll take a look at this one soon! |
private final ByteBuffer lowerBound; | ||
private final ByteBuffer upperBound; | ||
private final Object lowerBound; | ||
private final Object upperBound; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this to Object
rather than ByteBuffer
? Seems like conversion to ByteBuffer would be cleaner if this was done in each writer because the writer already has its type because it is going to call the right method on the encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we convert this to ByteBuffer
now we may still need to check type when doing truncation (based on metrics mode), and I think string with non-unicode characters will not vend the same result if truncated by BinaryUtil.truncateBinary
, so we will either convert the byte buffer back to char sequence and use UnicodeUtil.truncateString
or create a new BinaryUtil.truncateString
. Whereas if we do conversion later when evaluating metrics, I think the code needed for the conversion itself isn't that bad since we know the type of the field, and that's the reason for me to do this change.
But one thing that may worth noting is that for the current approach, in order ensure the Conversions.toByteBuffer
could work, for certain writers I have to make sure the min/max from the value writers return the type that Conversions.toByteBuffer
knows how to translate, if the data type in write
is not of that type (that is, usage of this method). I think we still need to maintain a similar function for translation in each value writer if we return bytebuffer for bounds in field metrics, but it will directly translate input data type to byte buffer instead of doing two hops, and that might be easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking that truncation would happen when FieldMetrics
is constructed, in the leaf writers. If that's not the case, then I think it makes sense to do the conversion later.
If the conversion happens later, then I think this class should be parameterized. I never like to have classes that track just Object
. We should at least guarantee that both lower and upper bounds are the same type, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly following the pattern of ORC and Parquet to evaluate metrics mode when collecting metrics (which has to be since the file formats collects stats themselves), but I think there's nothing prevent us from ingesting metrics mode during value writers creation, it will just make the visitor pattern a little bit more complicated. I'll give it a try, and thanks for bringing up this idea!
I guess for now I'll revert the change to FieldMetrics
in this PR and include it in the next one that updates value writers if we need to change it. Hopefully that doesn't add too much to the next PR!
@@ -157,6 +157,10 @@ public int compare(List<T> o1, List<T> o2) { | |||
return UnsignedByteBufComparator.INSTANCE; | |||
} | |||
|
|||
public static Comparator<byte[]> unsignedByteArray() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other method names are plural. Could we use unsignedByteArrays()
?
@@ -123,7 +127,7 @@ public WriteBuilder named(String newName) { | |||
return this; | |||
} | |||
|
|||
public WriteBuilder createWriterFunc(Function<Schema, DatumWriter<?>> writerFunction) { | |||
public WriteBuilder createWriterFunc(Function<Schema, MetricsAwareDatumWriter<?>> writerFunction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break existing uses of createWriterFunc
in projects that build on Iceberg. I think this should keep the old parameter and just check whether the implementation is MetricsAwareDatumWriter
in the appender to return metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I didn't think of the case where people have their own implementation of these interfaces so I totally missed this. Will update and keep in mind!
@yyanyy, can you rebase this to fix the conflict? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rebasing, it looks good to me
} | ||
} | ||
|
||
// if there are no differences, then the shorter seq is first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "the shorter seq is first" is a bit confusing to me, maybe "is smaller" is a better word.
|
||
@Override | ||
public Stream<FieldMetrics> metrics() { | ||
return Stream.concat(PATH_WRITER.metrics(), POS_WRITER.metrics()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also include metrics from the rowWriter
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be fixed in a follow-up.
CodecFactory codec, Map<String, String> metadata) throws IOException { | ||
DataFileWriter<D> writer = new DataFileWriter<>( | ||
(DatumWriter<D>) createWriterFunc.apply(schema)); | ||
(DatumWriter<D>) metricsAwareDatumWriter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I don't think this needs to be a MetricsAwareDatumWriter
, right? It isn't in the type signature, so we should name it just datumWriter
.
Thanks, @yyanyy! This looks good now so I merged it. That should unblock the next steps. |
This change is a smaller PR broken down from #1935. There is no change in behavior from this PR. It covers the following:
metrics()
method toValueWriter
for Avro, currently default to empty streamMetricsAwareDatumWriter
that exposes writer metrics, and replaceDatumWriter
with it in various classesAvroMetrics
class that resembles current behavior for producing metrics for avro writer