Support Metrics mode when creating/writing Iceberg tables #9791

liqinrae · 2021-10-27T14:23:47Z

By default all metrics [nullvaluecounts, nanvaluecounts, upper-lower-bounds, plus more] are persisted in manifests files leading to manifest bloat, occupying around 80% or more of the manifest contents. We should customize the metrics stats and keep the ones which customers usually query by.

We should support:

Default metrics mode is changed to truncate(16)
Allow for configuring default mode to NONE so that we can skip all metrics by default except the ones we want
Allows per column metrics mode configuration to support different modes.

homar · 2022-06-03T19:57:31Z

It is done for ORC, for parquet it is not yet done.

findepi · 2022-12-13T15:16:17Z

Default metrics mode is changed to truncate(16)

This sounds generally correct, but some data volumes will suffer from such a change.
For example, in Iceberg deletion files we had to force min/max not to have any truncation, because file paths have long common prefix. Users' data may exhibit similar patterns.

Wonder whether it could be possible to come up with an automated decision making process for this.
Seems like the potential truncation length decision cannot be made solely based on data within a single file (deletion files case), but perhaps can be made when file-level NDV for a column is > 1 (or > 10). In Such case, we know something about "how the values look like" and we may be able to know common prefix length.

Knowing the data between files would be even better. That may be some statistical information that we gather from time to time on a table.

jinyangli34 · 2024-07-29T23:11:30Z

@liqinrae @findepi is there any issue/concern on Parquet to use metrics mode?
or it's just a simple todo task to update this line?

raunaqmorarka · 2024-08-27T10:18:25Z

We don't want the presence of column level metrics to be user configurable.
We want all metrics to be collected by default because Trino has been adding usage of column statistics in new optimizer rules besides join reordering, and it's not easy for users to discover which column metrics would be useful.
We already truncate large string statistics automatically in orc and parquet writers in Trino and the iceberg table metadata is just derived from these file format statistics. See io.trino.parquet.writer.PrimitiveColumnWriter#MAX_STATISTICS_LENGTH_IN_BYTES for example.
We are able to cache manifest file contents in memory #22739 or on local filesystem now #20803 to reduce latency.
We also request only projected column statistics from iceberg metadata after #22584 and #22555.
I don't see a reason to pursue this change for above reasons.

liqinrae mentioned this issue Nov 11, 2021

Support Metrics mode when creating/writing Iceberg tables #9938

Merged

findepi mentioned this issue Nov 30, 2021

Iceberg Connector #1324

Closed

93 tasks

findepi added the enhancement New feature or request label Nov 30, 2021

findepi mentioned this issue Jun 6, 2022

Don't truncate min/max for iceberg delete files #12671

Merged

raunaqmorarka closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Metrics mode when creating/writing Iceberg tables #9791

Support Metrics mode when creating/writing Iceberg tables #9791

liqinrae commented Oct 27, 2021

homar commented Jun 3, 2022

findepi commented Dec 13, 2022

jinyangli34 commented Jul 29, 2024

raunaqmorarka commented Aug 27, 2024

Support Metrics mode when creating/writing Iceberg tables #9791

Support Metrics mode when creating/writing Iceberg tables #9791

Comments

liqinrae commented Oct 27, 2021

homar commented Jun 3, 2022

findepi commented Dec 13, 2022

jinyangli34 commented Jul 29, 2024

raunaqmorarka commented Aug 27, 2024