Multiply on disk column sizes of iceberg data files by 4 for column stats #15186

homar · 2022-11-24T17:14:26Z

Description

These stats are used by CBO that expects sizes of data while it resides in memory.
The idea is to use column-sizes from Iceberg metadata stored there by parquet writer. From some tests I ran it looks like multiplying by 10 looks like a good heuristic.

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java

findepi · 2022-11-25T10:28:44Z

please rename: "OCB stats" to "column stats"

findepi · 2022-11-25T10:30:45Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java

@@ -124,6 +126,11 @@ private TableStatistics makeTableStatistics(IcebergTableHandle tableHandle)
            if (summary.getColumnSizes() != null) {
                Long columnSize = summary.getColumnSizes().get(fieldId);
                if (columnSize != null) {
+                    if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) {


nit: you'd typically static import these constants and compare them using ==

findepi · 2022-11-25T10:31:30Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java

+                    if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) {
+                        // columnSize value is in fact size of column stored on disk which is after compression and is much smaller than
+                        // the column size in memory. Multiplying by 10 seems like a good heuristic to compensate for that
+                        columnSize = columnSize * 10;


statistical compression factor for binary is likely different than for text.
i don't know whether we have any "data" to back up the * 10 number

when Iceberg type is fixed(L) the Trino type is VARBINARY, but Iceberg actually knows that each cell is L bytes. Can we tell the engine about that?

findepi · 2022-11-25T10:33:58Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java

@@ -124,6 +126,11 @@ private TableStatistics makeTableStatistics(IcebergTableHandle tableHandle)
            if (summary.getColumnSizes() != null) {
                Long columnSize = summary.getColumnSizes().get(fieldId);
                if (columnSize != null) {
+                    if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) {


maybe something like this

if (columnHandle.getBaseType() instanceof FixedWidthType) { // columnSize is the size on disk and Trino column stats is size in memory. // The relation between the two is type and data dependent. // However, Trino currently does not use data size statistics for fixed-width types // (it's not needed for them), so do not report it at all, to avoid reporting some bogus value. } else

or maybe report columnSize only when column type is varchar/varbinary

findepi · 2022-11-25T10:35:30Z

Multiply on disk column sizes of parquet files by 10 for OCB stats

The title is wrong. The code change applies to parquet and orc files (and avro) equally.
-- Trino doesn't write columnSize for orc files, but some other application may.

findepi · 2022-11-25T10:37:16Z

Can you also add some test that would exercise Trino SHOW STATS on a table that was created and written by Spark, for each file format?

TestIcebergSparkCompatibility current runs SHOW STATS only for tables with int columns

raunaqmorarka · 2022-11-25T11:27:13Z

Is the effect of the changed estimates applicable to BaseIcebergCostBasedPlanTest or do we need to manually update something there ?

findepi · 2022-11-25T15:08:36Z

@raunaqmorarka it's on read path, so no, we don't need to update test data' metadata stored there.

raunaqmorarka · 2022-11-25T18:14:28Z

@raunaqmorarka it's on read path, so no, we don't need to update test data' metadata stored there.

I see, I'm a bit surprised then that no TPC plan changed as a result of this.

findepi · 2022-11-25T22:22:15Z

@raunaqmorarka can it be that both sides are equally "inflated" and so ideal plans don't change?

raunaqmorarka · 2022-11-26T05:51:57Z

@raunaqmorarka can it be that both sides are equally "inflated" and so ideal plans don't change?

There could be impact to choice of broadcast join due to max_broadcast_size threshold being breached. When all the terms are estimated, this would turn a broadcast join into repartitioned join without impacting the ordering. When there are unestimated terms, we fall back to estimates of table sizes for build vs probe choice as well.
Maybe there is genuinely no impact to the TPC plans, but would be good to manually confirm that the BaseIcebergCostBasedPlanTest tests are indeed experiencing this change and also run our internal TPC benchmarks to confirm no regressions.

findepi · 2022-11-28T17:02:27Z

Changing the columnSize = columnSize * 10; to multply by 100 still doesn't cause any changes for TestIcebergTpcdsCostBasedPlan. My guess is these Iceberg ORC datasets were written with Trino and don't have column sizes in the metadata files.

findepi · 2022-11-30T11:36:56Z

Indeed, with Iceberg Parquet plan tests (#15255), this change would be reflected in changed plans.

findepi · 2022-12-01T14:36:49Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java

+                    else {
+                        if (idToType.get(columnHandleTuple.getKey()).typeId() == Type.TypeID.FIXED) {
+                            Types.FixedType fixedType = (Types.FixedType) idToType.get(columnHandleTuple.getKey());
+                            columnSize = (long) fixedType.length();


That should be filled in also when summary.getColumnSizes().get(fieldId) == null

findepi · 2022-12-01T14:39:12Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java

+                        }
+                        else if (columnHandle.getBaseType() == VARBINARY) {
+                            // Tests showed that for VARCHAR multiplying by 1.4 seems like a good heuristic
+                            columnSize = (long) (columnSize * 1.4);


Any comment how 1.4 was chosen? (same for 2.7 above)
Can you add a comment?

initially we thought 10x will be OK
are you measuring on some "real data", or using the "random hex" test case we run internally?

also, note that size on disk can be very small when data is dictionary encoded. maybe it's safer to overshoot rather than undershoot? cc @raunaqmorarka @sopel39 for over- vs under-estimation

Comment added.

// Tested using item table from TPCDS benchmark // compared column size of item_desc column stored inside files // with length of values in that column reported by trino columnSize = (long) (columnSize * 2.7); } else if (columnHandle.getBaseType() == VARBINARY) { // Tested using VARBINARY columns with random, both in length and content, data // compared column size stored inside parquet files with length of bytes written to it columnSize = (long) (columnSize * 1.4);

Does it make sense?

// Tested using item table from TPCDS benchmark
// compared column size of item_desc column stored inside files
// with length of values in that column reported by trino
columnSize = (long) (columnSize * 2.7);

this is awesome!

else if (columnHandle.getBaseType() == VARBINARY) {
// Tested using VARBINARY columns with random, both in length and content, data
// compared column size stored inside parquet files with length of bytes written to it
columnSize = (long) (columnSize * 1.4);

how was the data randomized?

truly random data should not compress at all, so you should have gotten ~1.0 factor

I took random strings, change them to byte and save that data, then I compared size on disk with lengths of byte arrays in memory. If you think that we should have 1.0 here then I can do it.

Random ascii or radnom full Unicode?

If you think that we should have 1.0 here then I can do it.

i did not say that. I am asking about methodology.

I used org.apache.commons.lang.RandomStringUtils. From its javadoc: Characters will be chosen from the set of alpha-numeric characters as indicated by the arguments

i did not say that. I am asking about methodology.

Sure I know but when I thought more about it I tend to agree that this should be close to 1.0

It should be close to 1 for random bytes.
Since you were generating alpha-numeric characters, the byte repertoire was limited, thus data was compressible, and hence > 1.0 factor.

I am fine keeping the 1.4 value, but we need to code-comment that the methodology was not very real-life-like and some better heuristic value could be proposed in the future.

But is it better to keep it 1.4 with code comment or to change it to 1 ? What do you think is betteR?

Not a big difference, let's go with a bigger number.

The only important part is the code comment -- don't pretend this is some "very smart value"

also, note that size on disk can be very small when data is dictionary encoded. maybe it's safer to overshoot rather than undershoot? cc @raunaqmorarka @sopel39 for over- vs under-estimation

Overshooting is safer.

findepi · 2022-12-01T14:40:39Z

Please rebase on #15255 (or wait for it to be merged).
This will help us realize the benchmarks are due.

cc @nineinchnick @przemekak for how can we run benchmarks most easily for this change?

przemekak · 2022-12-01T17:38:43Z

@findepi I think atm the easiest way is to run this workflow: https://github.com/starburstdata/benchmarks-gha/actions/workflows/standard-benchmarks.yaml

findepi · 2022-12-05T12:39:38Z

@homar thank you for re-running the CI.
The TestIcebergParquetTpchCostBasedPlan failed as expected. You can update the tests' expected plans with TestIcebergParquetTpchCostBasedPlan.main.

homar · 2022-12-05T13:44:59Z

CI hit #15313

findepi · 2022-12-05T19:36:41Z

CI hit #8662

I don't see any new comment under that issue.
Did you mean to paste a link to build run and a relevant excerpt from the logs?

homar · 2022-12-05T20:39:04Z

I don't see any new comment under that issue.
Did you mean to paste a link to build run and a relevant excerpt from the logs?

I only meant to mention that issue because the same test failed here. I probably should have pasted logs from my failure there but I didn't and now i can't access them. sorry

findepi · 2022-12-05T22:37:07Z

I probably should have pasted logs from my failure there but I didn't and now i can't access them. sorry

no problem

btw all past runs are accessible here: https://github.com/trinodb/trino/actions/workflows/ci.yml?query=branch%3Ahomar%2Fcollect_memory_column_sizes_for_iceberg_with_parquet
(https://github.com/trinodb/trino/actions/workflows/ci.yml > select "ci" action on the sidebar > filter by your branch name)

homar · 2022-12-08T10:04:52Z

homar test column sizes.pdf
This is the result of benchmarking this change. As the results are little mixed I am not able to make any clear conclusion. @findepi @sopel39 @raunaqmorarka could you help here?

sopel39 · 2022-12-08T10:25:08Z

homar test column sizes.pdf
This is the result of benchmarking this change. As the results are little mixed I am not able to make any clear conclusion. @findepi @sopel39 @raunaqmorarka could you help here?

It looks like there is an improvement for TPC-DS.
Some queries in TPCH look a bit flaky. Did plan for tpch/q04 changed?

findepi · 2022-12-08T10:25:17Z

There are some ups and downs in the results, but overall they look OK to me.

(i defer to @sopel39 and @raunaqmorarka on final judgement)

findepi · 2022-12-08T10:26:12Z

Did plan for tpch/q04 changed?

@sopel39 no (see https://github.com/trinodb/trino/pull/15186/files)

homar · 2022-12-09T11:43:35Z

@sopel39 @raunaqmorarka any decision?

sopel39 · 2022-12-09T11:44:16Z

@homar I say go ahead

cla-bot bot added the cla-signed label Nov 24, 2022

homar requested a review from findepi November 24, 2022 17:17

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 96e9019 to f6d02d0 Compare November 24, 2022 21:48

ebyhr reviewed Nov 25, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java Outdated Show resolved Hide resolved

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from f6d02d0 to e3eeea3 Compare November 25, 2022 09:01

findepi reviewed Nov 25, 2022

View reviewed changes

findepi mentioned this pull request Nov 30, 2022

Add Iceberg Parquet plan tests #15255

Merged

homar changed the title ~~Multiply on disk column sizes of parquet files by 10 for OCB stats~~ Multiply on disk column sizes of parquet files by 10 for column stats Dec 1, 2022

homar changed the title ~~Multiply on disk column sizes of parquet files by 10 for column stats~~ Multiply on disk column sizes of iceberg data files by 10 for column stats Dec 1, 2022

homar changed the title ~~Multiply on disk column sizes of iceberg data files by 10 for column stats~~ Multiply on disk column sizes of iceberg data files by 4 for column stats Dec 1, 2022

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from e3eeea3 to 36e86f4 Compare December 1, 2022 14:03

findepi reviewed Dec 1, 2022

View reviewed changes

github-actions bot added the tests:hive label Dec 1, 2022

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch 2 times, most recently from ddfd5ec to 0d08200 Compare December 2, 2022 10:10

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 0d08200 to 249e0de Compare December 5, 2022 14:34

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 249e0de to 37e7224 Compare December 6, 2022 11:51

Multiply on disk column sizes of iceberg files by 4 for column stats

bc1c0a4

homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 37e7224 to bc1c0a4 Compare December 7, 2022 13:41

findepi merged commit bdaf17c into trinodb:master Dec 9, 2022

findepi added the no-release-notes This pull request does not require release notes entry label Dec 9, 2022

github-actions bot added this to the 404 milestone Dec 9, 2022

findepi mentioned this pull request Dec 9, 2022

Track uncompressed data size for varchar and varbinary in Iceberg #15150

Open

3 tasks

colebow mentioned this pull request Dec 14, 2022

Add Trino 405 release notes #15139

Merged

Multiply on disk column sizes of iceberg data files by 4 for column stats #15186

Multiply on disk column sizes of iceberg data files by 4 for column stats #15186

Conversation

homar commented Nov 24, 2022 • edited Loading

Description

Release notes

findepi commented Nov 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Nov 25, 2022

findepi commented Nov 25, 2022

raunaqmorarka commented Nov 25, 2022

findepi commented Nov 25, 2022

raunaqmorarka commented Nov 25, 2022

findepi commented Nov 25, 2022

raunaqmorarka commented Nov 26, 2022

findepi commented Nov 28, 2022

findepi commented Nov 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

homar Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 1, 2022

przemekak commented Dec 1, 2022

findepi commented Dec 5, 2022

homar commented Dec 5, 2022 • edited Loading

findepi commented Dec 5, 2022

homar commented Dec 5, 2022

findepi commented Dec 5, 2022

homar commented Dec 8, 2022

sopel39 commented Dec 8, 2022 • edited Loading

findepi commented Dec 8, 2022 • edited Loading

findepi commented Dec 8, 2022

homar commented Dec 9, 2022

sopel39 commented Dec 9, 2022

homar commented Nov 24, 2022 •

edited

Loading

homar Dec 5, 2022 •

edited

Loading

homar commented Dec 5, 2022 •

edited

Loading

sopel39 commented Dec 8, 2022 •

edited

Loading

findepi commented Dec 8, 2022 •

edited

Loading