-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support data_size
when analyzing in Delta Lake
#12814
Conversation
f6a66fa
to
64ff015
Compare
CI hit #12818 |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
.map(columnMetadata -> new ColumnStatisticMetadata(columnMetadata.getName(), NUMBER_OF_DISTINCT_VALUES_SUMMARY)) | ||
.forEach(columnStatistics::add); | ||
.forEach(columnMetadata -> { | ||
columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), TOTAL_SIZE_IN_BYTES)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to calculate TOTAL_SIZE_IN_BYTES for eg numbers, since engine doesn't use this value for fixed-width data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if we decide we cannot merge existing stats with no data size (see other commnent), we should ask engine to collect them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we decide we cannot merge existing stats with no data size (see other commnent), we should ask engine to collect them
I don't understand how to achieve this. Could you share the details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In io.trino.plugin.deltalake.DeltaLakeMetadata#getStatisticsCollectionMetadata
we already read the ExtendedStatistics
. We can check whether we have data size for selected columns.
If we don't, we just don't create ColumnStatisticMetadata
asking to collect TOTAL_SIZE_IN_BYTES
totalSize = getLongValue(computedStatistics.get(TOTAL_SIZE_IN_BYTES)); | ||
} | ||
|
||
HyperLogLog ndvSummary = HyperLogLog.newInstance(APPROX_SET_NUMBER_OF_BUCKETS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't invoke this when computedStatistics.containsKey(NUMBER_OF_DISTINCT_VALUES_SUMMARY)
, since this may be somewhat expensive allocation.
BTW what is the case when ! computedStatistics.containsKey(NUMBER_OF_DISTINCT_VALUES_SUMMARY)
?
We ask engine to calculate the HLL, so we may expect it to be present, right?
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
...delta-lake/src/main/java/io/trino/plugin/deltalake/statistics/DeltaLakeColumnStatistics.java
Outdated
Show resolved
Hide resolved
...lake/src/test/java/io/trino/plugin/deltalake/metastore/TestDeltaLakeMetastoreStatistics.java
Outdated
Show resolved
Hide resolved
...lake/src/test/java/io/trino/plugin/deltalake/metastore/TestDeltaLakeMetastoreStatistics.java
Show resolved
Hide resolved
@losipiuk PTAL |
There's a version number in |
We could, but we would need to update the logic here trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Line 1980 in fa83b65
(since the change is quite backwards compatible, i though we're going to leave the current number as is, but maybe i under-appreciate some consequences of doing so) |
I didn't increase the number since I thought this is backward compatible as @findepi already said. Let me know if we should increment the number. |
a24426a
to
4cd8683
Compare
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
4cd8683
to
df757b8
Compare
CI hit #12858 |
Description
Add support for
data_size
when analyzing in Delta LakeDocumentation
(x) Sufficient documentation is included in this PR.
Release notes
(x) Release notes entries required with the following suggested text: