-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Metrics mode when creating/writing Iceberg tables #9938
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: q-li.
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
2 similar comments
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
if (metricsMode != MetricsModes.Counts.get()) { | ||
toIcebergMinMax(orcColumnStats, icebergField.type(), metricsMode).ifPresent(minMax -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not handle metrics mode truncate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdsr yes we do, we handle truncate in the IcebergMinMax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we added support for this below
@@ -431,7 +432,8 @@ public ConnectorOutputTableHandle beginCreateTable(ConnectorSession session, Con | |||
getColumns(transaction.table().schema(), typeManager), | |||
transaction.table().location(), | |||
getFileFormat(transaction.table()), | |||
transaction.table().properties()); | |||
transaction.table().properties(), | |||
Collections.emptyMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole table property map is already passed in as a part of the writable table handle, why do we need to construct the metrics config as a separated field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 yes you are right. There is no need to construct the metrics config since the table properties have already been passed to the writable table handle. I checked the code and we already have functions in place to retrieve metrics related properties so we can make this even simpler. Will add a fix to this, thanks!
@@ -128,6 +131,7 @@ public IcebergPageSink( | |||
this.jsonCodec = requireNonNull(jsonCodec, "jsonCodec is null"); | |||
this.session = requireNonNull(session, "session is null"); | |||
this.fileFormat = requireNonNull(fileFormat, "fileFormat is null"); | |||
this.metricsConfig = MetricsConfig.fromProperties(requireNonNull(storageProperties, "metricsConfig is null")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error message does not match.
Btw, I think it makes more sense to pass in the MetricsConfig
to the constructor instead of storageProperties
, because file format is also derived from storageProperties
and we are passing it in separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 Sorry you meant passing in the MetricsConfig
to the pagesink
or IcebergWritableTableHandle
constructor?
The file format is derived from the property map and then passed to the writable table handle as a separate component, which is how we passed inMetricsConfig
initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to the page sink. But I don't have a strong opinion on this, there are many duplicated information passed across the codebase as of today. We can just fix the error message first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liqinrae can you please check the Ci build results? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash your commits.
When doing so, try not to change the base commit:
git rebase -i $(git merge-base head master)
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSinkProvider.java
Show resolved
Hide resolved
upperBoundsBuilder.put(icebergId, minMax.getMax()); | ||
}); | ||
|
||
if (metricsMode != MetricsModes.Counts.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricsMode
is an interface, open for extensions. The fact that something isn't Counts doesn't mean it's Full.
Also, there is no javadoc indicating .get()
returns same instance every time, so using .equals()
would be more appropriate here.
this should be handled in a defensive manner:
if (metricsMode.equals(MetricsModes.None.get())) {
// nothing to do
}
else if (metricsMode.equals(MetricsModes.Counts.get()) {
.. collect counts
}
else if (metricsMode.equals( full )) {
...
}
else {
throw new UnsupportedOperationException("Unsupported metrics mode: " + metricsMode);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi The conditional statement here handles the case where the metricsMode is truncate(length) or full. Both of these two cases need to store min/max except that truncate need to apply truncation if needed.
All of the metrics modes other than None need to collect counts. There would be duplication of code if we are to handle each mode separately. Would adding a verify
function here more feasible if we want to handle it more defensively?
this.max = Conversions.toByteBuffer(type, max); | ||
// Out of the two types which could be truncated, string or binary, ORC only supports string bounds. | ||
// Therefore, truncation will be applied if needed only on string type. | ||
if (metricsMode instanceof MetricsModes.Truncate && type.typeId() == org.apache.iceberg.types.Type.TypeID.STRING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary (varbinary) and fixed too
(we don't support creating tables with fixed
, but i think we do support writes to such tables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi I was referring to the OrcMetrics
implementation here: https://github.com/apache/iceberg/blob/ddc5aff9167c7e367e2b4313b91e55451f09ef16/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L281
Seems like it only added truncation for string bound in the orc format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because they don't produce any stats for binary for ORC, see
trino/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergOrcConnectorTest.java
Line 31 in 88ef2ca
return !(typeName.equalsIgnoreCase("varbinary")); |
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still, not having an if
for varbinary doesn't seem future proof, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi yes I agree. I'll add varbinary here.
this.min = Conversions.toByteBuffer(type, min); | ||
this.max = Conversions.toByteBuffer(type, max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if MetricsMode is != Full (eg Counts, None), no min/max should be stored.
again, this should be handled in defensive manner, instead of if-ing on a Full class
@@ -129,7 +249,7 @@ public void testBasic() | |||
assertQuery("SELECT min(orderpriority) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(6) + "'"); | |||
assertQuery("SELECT min(clerk) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(7) + "'"); | |||
assertQuery("SELECT min(shippriority) FROM tpch.tiny.orders", "VALUES " + lowerBounds.get(8)); | |||
assertQuery("SELECT min(comment) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'"); | |||
assertQuery("SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowerBounds.get(9)
is the actual, while SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders
is the expected. The expected/actual are reversed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi I tried reversing the order for all the assertQuery
in the test function but encountered failures when using SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders
as the expected sql statement. Checked other test files and looks like they placed all select
statement as the expected and result as the actual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't use assertQuery
at all, when we want to assert a value that we have at hand.
this is preexisting, so ok to followup
031c2fd
to
000ad30
Compare
Hi @findepi @jackye1995 I addressed the comments and left a few replies on this PR. Please let me know your thoughts. Thanks! |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergOrcFileWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergOrcFileWriter.java
Outdated
Show resolved
Hide resolved
@@ -129,7 +249,7 @@ public void testBasic() | |||
assertQuery("SELECT min(orderpriority) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(6) + "'"); | |||
assertQuery("SELECT min(clerk) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(7) + "'"); | |||
assertQuery("SELECT min(shippriority) FROM tpch.tiny.orders", "VALUES " + lowerBounds.get(8)); | |||
assertQuery("SELECT min(comment) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'"); | |||
assertQuery("SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders", "VALUES '" + lowerBounds.get(9) + "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't use assertQuery
at all, when we want to assert a value that we have at hand.
this is preexisting, so ok to followup
@@ -141,7 +261,8 @@ public void testBasic() | |||
assertQuery("SELECT max(orderpriority) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(6) + "'"); | |||
assertQuery("SELECT max(clerk) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(7) + "'"); | |||
assertQuery("SELECT max(shippriority) FROM tpch.tiny.orders", "VALUES " + upperBounds.get(8)); | |||
assertQuery("SELECT max(comment) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(9) + "'"); | |||
// the default truncate(16) mode would round up the last character of the maximum bound while keeping the first 15 characters identical | |||
assertQuery("SELECT substring(max(comment), 1, 15) FROM tpch.tiny.orders", "VALUES '" + upperBounds.get(9).substring(0, 15) + "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert on upperBounds.get(9)
without taking substring
on the left side, use substring(max(comment), 1, 15) || '...'
to construct the expected value
@liqinrae have you had time to update this for my comments? |
000ad30
to
3c81be5
Compare
7f52b71
to
f319189
Compare
3040381
to
ffb9007
Compare
Added support for all metrics modes in orc format as mentioned in #9791