Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data_size when analyzing in Delta Lake #12814

Merged
merged 1 commit into from
Jun 15, 2022

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jun 13, 2022

Description

Add support for data_size when analyzing in Delta Lake

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# Delta Lake
* Add support for `data_size` when analyzing a table. ({issue}`12814`)

@cla-bot cla-bot bot added the cla-signed label Jun 13, 2022
@github-actions github-actions bot added the docs label Jun 13, 2022
@ebyhr ebyhr force-pushed the ebi/delta-varchar-statistics branch from f6a66fa to 64ff015 Compare June 13, 2022 07:15
@ebyhr
Copy link
Member Author

ebyhr commented Jun 13, 2022

CI hit #12818

@ebyhr ebyhr requested review from raunaqmorarka and findepi June 13, 2022 11:22
@findepi findepi requested a review from alexjo2144 June 13, 2022 13:32
docs/src/main/sphinx/connector/delta-lake.rst Show resolved Hide resolved
.map(columnMetadata -> new ColumnStatisticMetadata(columnMetadata.getName(), NUMBER_OF_DISTINCT_VALUES_SUMMARY))
.forEach(columnStatistics::add);
.forEach(columnMetadata -> {
columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), TOTAL_SIZE_IN_BYTES));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to calculate TOTAL_SIZE_IN_BYTES for eg numbers, since engine doesn't use this value for fixed-width data types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, if we decide we cannot merge existing stats with no data size (see other commnent), we should ask engine to collect them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we decide we cannot merge existing stats with no data size (see other commnent), we should ask engine to collect them

I don't understand how to achieve this. Could you share the details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In io.trino.plugin.deltalake.DeltaLakeMetadata#getStatisticsCollectionMetadata we already read the ExtendedStatistics. We can check whether we have data size for selected columns.
If we don't, we just don't create ColumnStatisticMetadata asking to collect TOTAL_SIZE_IN_BYTES

totalSize = getLongValue(computedStatistics.get(TOTAL_SIZE_IN_BYTES));
}

HyperLogLog ndvSummary = HyperLogLog.newInstance(APPROX_SET_NUMBER_OF_BUCKETS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't invoke this when computedStatistics.containsKey(NUMBER_OF_DISTINCT_VALUES_SUMMARY), since this may be somewhat expensive allocation.

BTW what is the case when ! computedStatistics.containsKey(NUMBER_OF_DISTINCT_VALUES_SUMMARY)?
We ask engine to calculate the HLL, so we may expect it to be present, right?

@findepi
Copy link
Member

findepi commented Jun 13, 2022

@losipiuk PTAL

@alexjo2144
Copy link
Member

There's a version number in ExtendedStats we should bump with this change

@findepi
Copy link
Member

findepi commented Jun 13, 2022

There's a version number in ExtendedStats we should bump with this change

We could, but we would need to update the logic here

statistics.getModelVersion() == ExtendedStatistics.CURRENT_MODEL_VERSION,

(since the change is quite backwards compatible, i though we're going to leave the current number as is, but maybe i under-appreciate some consequences of doing so)

@ebyhr
Copy link
Member Author

ebyhr commented Jun 14, 2022

I didn't increase the number since I thought this is backward compatible as @findepi already said. Let me know if we should increment the number.

@ebyhr ebyhr force-pushed the ebi/delta-varchar-statistics branch 2 times, most recently from a24426a to 4cd8683 Compare June 15, 2022 06:15
@ebyhr ebyhr requested a review from findepi June 15, 2022 10:37
@ebyhr ebyhr force-pushed the ebi/delta-varchar-statistics branch from 4cd8683 to df757b8 Compare June 15, 2022 13:29
@ebyhr
Copy link
Member Author

ebyhr commented Jun 15, 2022

CI hit #12858

@ebyhr ebyhr merged commit c440595 into trinodb:master Jun 15, 2022
@ebyhr ebyhr deleted the ebi/delta-varchar-statistics branch June 15, 2022 23:26
@ebyhr ebyhr mentioned this pull request Jun 15, 2022
@github-actions github-actions bot added this to the 387 milestone Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants