Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Metrics mode when creating/writing Iceberg tables #9791

Closed
Tracked by #1324
liqinrae opened this issue Oct 27, 2021 · 4 comments
Closed
Tracked by #1324

Support Metrics mode when creating/writing Iceberg tables #9791

liqinrae opened this issue Oct 27, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@liqinrae
Copy link
Contributor

By default all metrics [nullvaluecounts, nanvaluecounts, upper-lower-bounds, plus more] are persisted in manifests files leading to manifest bloat, occupying around 80% or more of the manifest contents. We should customize the metrics stats and keep the ones which customers usually query by.

We should support:

  1. Default metrics mode is changed to truncate(16)
  2. Allow for configuring default mode to NONE so that we can skip all metrics by default except the ones we want
  3. Allows per column metrics mode configuration to support different modes.
@homar
Copy link
Member

homar commented Jun 3, 2022

It is done for ORC, for parquet it is not yet done.

@findepi
Copy link
Member

findepi commented Dec 13, 2022

Default metrics mode is changed to truncate(16)

This sounds generally correct, but some data volumes will suffer from such a change.
For example, in Iceberg deletion files we had to force min/max not to have any truncation, because file paths have long common prefix. Users' data may exhibit similar patterns.

Wonder whether it could be possible to come up with an automated decision making process for this.
Seems like the potential truncation length decision cannot be made solely based on data within a single file (deletion files case), but perhaps can be made when file-level NDV for a column is > 1 (or > 10). In Such case, we know something about "how the values look like" and we may be able to know common prefix length.

Knowing the data between files would be even better. That may be some statistical information that we gather from time to time on a table.

@jinyangli34
Copy link
Contributor

@liqinrae @findepi is there any issue/concern on Parquet to use metrics mode?
or it's just a simple todo task to update this line?

@raunaqmorarka
Copy link
Member

We don't want the presence of column level metrics to be user configurable.
We want all metrics to be collected by default because Trino has been adding usage of column statistics in new optimizer rules besides join reordering, and it's not easy for users to discover which column metrics would be useful.
We already truncate large string statistics automatically in orc and parquet writers in Trino and the iceberg table metadata is just derived from these file format statistics. See io.trino.parquet.writer.PrimitiveColumnWriter#MAX_STATISTICS_LENGTH_IN_BYTES for example.
We are able to cache manifest file contents in memory #22739 or on local filesystem now #20803 to reduce latency.
We also request only projected column statistics from iceberg metadata after #22584 and #22555.
I don't see a reason to pursue this change for above reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

5 participants