Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiply on disk column sizes of iceberg data files by 4 for column stats #15186

Conversation

homar
Copy link
Member

@homar homar commented Nov 24, 2022

Description

These stats are used by CBO that expects sizes of data while it resides in memory.
The idea is to use column-sizes from Iceberg metadata stored there by parquet writer. From some tests I ran it looks like multiplying by 10 looks like a good heuristic.

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 24, 2022
@homar homar requested a review from findepi November 24, 2022 17:17
@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 96e9019 to f6d02d0 Compare November 24, 2022 21:48
@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from f6d02d0 to e3eeea3 Compare November 25, 2022 09:01
@findepi
Copy link
Member

findepi commented Nov 25, 2022

please rename: "OCB stats" to "column stats"

@@ -124,6 +126,11 @@ private TableStatistics makeTableStatistics(IcebergTableHandle tableHandle)
if (summary.getColumnSizes() != null) {
Long columnSize = summary.getColumnSizes().get(fieldId);
if (columnSize != null) {
if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you'd typically static import these constants and compare them using ==

if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) {
// columnSize value is in fact size of column stored on disk which is after compression and is much smaller than
// the column size in memory. Multiplying by 10 seems like a good heuristic to compensate for that
columnSize = columnSize * 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • statistical compression factor for binary is likely different than for text.
    i don't know whether we have any "data" to back up the * 10 number
  • when Iceberg type is fixed(L) the Trino type is VARBINARY, but Iceberg actually knows that each cell is L bytes. Can we tell the engine about that?

@@ -124,6 +126,11 @@ private TableStatistics makeTableStatistics(IcebergTableHandle tableHandle)
if (summary.getColumnSizes() != null) {
Long columnSize = summary.getColumnSizes().get(fieldId);
if (columnSize != null) {
if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like this

if (columnHandle.getBaseType() instanceof FixedWidthType) {
  // columnSize is the size on disk and Trino column stats is size in memory.
  // The relation between the two is type and data dependent.
  // However, Trino currently does not use data size statistics for fixed-width types
 // (it's not needed for them), so do not report it at all, to avoid reporting some bogus value.
}
else 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe report columnSize only when column type is varchar/varbinary

@findepi
Copy link
Member

findepi commented Nov 25, 2022

Multiply on disk column sizes of parquet files by 10 for OCB stats

The title is wrong. The code change applies to parquet and orc files (and avro) equally.
-- Trino doesn't write columnSize for orc files, but some other application may.

@findepi
Copy link
Member

findepi commented Nov 25, 2022

Can you also add some test that would exercise Trino SHOW STATS on a table that was created and written by Spark, for each file format?

TestIcebergSparkCompatibility current runs SHOW STATS only for tables with int columns

@raunaqmorarka
Copy link
Member

Is the effect of the changed estimates applicable to BaseIcebergCostBasedPlanTest or do we need to manually update something there ?

@findepi
Copy link
Member

findepi commented Nov 25, 2022

@raunaqmorarka it's on read path, so no, we don't need to update test data' metadata stored there.

@raunaqmorarka
Copy link
Member

@raunaqmorarka it's on read path, so no, we don't need to update test data' metadata stored there.

I see, I'm a bit surprised then that no TPC plan changed as a result of this.

@findepi
Copy link
Member

findepi commented Nov 25, 2022

@raunaqmorarka can it be that both sides are equally "inflated" and so ideal plans don't change?

@raunaqmorarka
Copy link
Member

@raunaqmorarka can it be that both sides are equally "inflated" and so ideal plans don't change?

There could be impact to choice of broadcast join due to max_broadcast_size threshold being breached. When all the terms are estimated, this would turn a broadcast join into repartitioned join without impacting the ordering. When there are unestimated terms, we fall back to estimates of table sizes for build vs probe choice as well.
Maybe there is genuinely no impact to the TPC plans, but would be good to manually confirm that the BaseIcebergCostBasedPlanTest tests are indeed experiencing this change and also run our internal TPC benchmarks to confirm no regressions.

@findepi
Copy link
Member

findepi commented Nov 28, 2022

Changing the columnSize = columnSize * 10; to multply by 100 still doesn't cause any changes for TestIcebergTpcdsCostBasedPlan. My guess is these Iceberg ORC datasets were written with Trino and don't have column sizes in the metadata files.

@findepi
Copy link
Member

findepi commented Nov 30, 2022

Indeed, with Iceberg Parquet plan tests (#15255), this change would be reflected in changed plans.

@homar homar changed the title Multiply on disk column sizes of parquet files by 10 for OCB stats Multiply on disk column sizes of parquet files by 10 for column stats Dec 1, 2022
@homar homar changed the title Multiply on disk column sizes of parquet files by 10 for column stats Multiply on disk column sizes of iceberg data files by 10 for column stats Dec 1, 2022
@homar homar changed the title Multiply on disk column sizes of iceberg data files by 10 for column stats Multiply on disk column sizes of iceberg data files by 4 for column stats Dec 1, 2022
@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from e3eeea3 to 36e86f4 Compare December 1, 2022 14:03
else {
if (idToType.get(columnHandleTuple.getKey()).typeId() == Type.TypeID.FIXED) {
Types.FixedType fixedType = (Types.FixedType) idToType.get(columnHandleTuple.getKey());
columnSize = (long) fixedType.length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be filled in also when summary.getColumnSizes().get(fieldId) == null

}
else if (columnHandle.getBaseType() == VARBINARY) {
// Tests showed that for VARCHAR multiplying by 1.4 seems like a good heuristic
columnSize = (long) (columnSize * 1.4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any comment how 1.4 was chosen? (same for 2.7 above)
Can you add a comment?

initially we thought 10x will be OK
are you measuring on some "real data", or using the "random hex" test case we run internally?

also, note that size on disk can be very small when data is dictionary encoded. maybe it's safer to overshoot rather than undershoot? cc @raunaqmorarka @sopel39 for over- vs under-estimation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

// Tested using item table from TPCDS benchmark
// compared column size of item_desc column stored inside files
// with length of values in that column reported by trino
columnSize = (long) (columnSize * 2.7);
}
else if (columnHandle.getBaseType() == VARBINARY) {
// Tested using VARBINARY columns with random, both in length and content, data
// compared column size stored inside parquet files with length of bytes written to it
columnSize = (long) (columnSize * 1.4);

Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Tested using item table from TPCDS benchmark
// compared column size of item_desc column stored inside files
// with length of values in that column reported by trino
columnSize = (long) (columnSize * 2.7);

this is awesome!

else if (columnHandle.getBaseType() == VARBINARY) {
// Tested using VARBINARY columns with random, both in length and content, data
// compared column size stored inside parquet files with length of bytes written to it
columnSize = (long) (columnSize * 1.4);

how was the data randomized?

truly random data should not compress at all, so you should have gotten ~1.0 factor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took random strings, change them to byte and save that data, then I compared size on disk with lengths of byte arrays in memory. If you think that we should have 1.0 here then I can do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random ascii or radnom full Unicode?

If you think that we should have 1.0 here then I can do it.

i did not say that. I am asking about methodology.

Copy link
Member Author

@homar homar Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used org.apache.commons.lang.RandomStringUtils. From its javadoc: Characters will be chosen from the set of alpha-numeric characters as indicated by the arguments

i did not say that. I am asking about methodology.

Sure I know but when I thought more about it I tend to agree that this should be close to 1.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be close to 1 for random bytes.
Since you were generating alpha-numeric characters, the byte repertoire was limited, thus data was compressible, and hence > 1.0 factor.

I am fine keeping the 1.4 value, but we need to code-comment that the methodology was not very real-life-like and some better heuristic value could be proposed in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it better to keep it 1.4 with code comment or to change it to 1 ? What do you think is betteR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big difference, let's go with a bigger number.

The only important part is the code comment -- don't pretend this is some "very smart value"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, note that size on disk can be very small when data is dictionary encoded. maybe it's safer to overshoot rather than undershoot? cc @raunaqmorarka @sopel39 for over- vs under-estimation

Overshooting is safer.

@findepi
Copy link
Member

findepi commented Dec 1, 2022

Please rebase on #15255 (or wait for it to be merged).
This will help us realize the benchmarks are due.

cc @nineinchnick @przemekak for how can we run benchmarks most easily for this change?

@przemekak
Copy link
Member

@findepi I think atm the easiest way is to run this workflow: https://github.com/starburstdata/benchmarks-gha/actions/workflows/standard-benchmarks.yaml

@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch 2 times, most recently from ddfd5ec to 0d08200 Compare December 2, 2022 10:10
@findepi
Copy link
Member

findepi commented Dec 5, 2022

@homar thank you for re-running the CI.
The TestIcebergParquetTpchCostBasedPlan failed as expected. You can update the tests' expected plans with TestIcebergParquetTpchCostBasedPlan.main.

@homar
Copy link
Member Author

homar commented Dec 5, 2022

CI hit #15313

@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 0d08200 to 249e0de Compare December 5, 2022 14:34
@findepi
Copy link
Member

findepi commented Dec 5, 2022

CI hit #8662

I don't see any new comment under that issue.
Did you mean to paste a link to build run and a relevant excerpt from the logs?

@homar
Copy link
Member Author

homar commented Dec 5, 2022

I don't see any new comment under that issue.
Did you mean to paste a link to build run and a relevant excerpt from the logs?

I only meant to mention that issue because the same test failed here. I probably should have pasted logs from my failure there but I didn't and now i can't access them. sorry

@findepi
Copy link
Member

findepi commented Dec 5, 2022

I probably should have pasted logs from my failure there but I didn't and now i can't access them. sorry

no problem

btw all past runs are accessible here: https://github.com/trinodb/trino/actions/workflows/ci.yml?query=branch%3Ahomar%2Fcollect_memory_column_sizes_for_iceberg_with_parquet
(https://github.com/trinodb/trino/actions/workflows/ci.yml > select "ci" action on the sidebar > filter by your branch name)

@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 249e0de to 37e7224 Compare December 6, 2022 11:51
@homar homar force-pushed the homar/collect_memory_column_sizes_for_iceberg_with_parquet branch from 37e7224 to bc1c0a4 Compare December 7, 2022 13:41
@homar
Copy link
Member Author

homar commented Dec 8, 2022

homar test column sizes.pdf
This is the result of benchmarking this change. As the results are little mixed I am not able to make any clear conclusion. @findepi @sopel39 @raunaqmorarka could you help here?

@sopel39
Copy link
Member

sopel39 commented Dec 8, 2022

homar test column sizes.pdf
This is the result of benchmarking this change. As the results are little mixed I am not able to make any clear conclusion. @findepi @sopel39 @raunaqmorarka could you help here?

It looks like there is an improvement for TPC-DS.
Some queries in TPCH look a bit flaky. Did plan for tpch/q04 changed?

@findepi
Copy link
Member

findepi commented Dec 8, 2022

There are some ups and downs in the results, but overall they look OK to me.

(i defer to @sopel39 and @raunaqmorarka on final judgement)

@findepi
Copy link
Member

findepi commented Dec 8, 2022

Did plan for tpch/q04 changed?

@sopel39 no (see https://github.com/trinodb/trino/pull/15186/files)

@homar
Copy link
Member Author

homar commented Dec 9, 2022

@sopel39 @raunaqmorarka any decision?

@sopel39
Copy link
Member

sopel39 commented Dec 9, 2022

@homar I say go ahead

@findepi findepi merged commit bdaf17c into trinodb:master Dec 9, 2022
@findepi findepi added the no-release-notes This pull request does not require release notes entry label Dec 9, 2022
@github-actions github-actions bot added this to the 404 milestone Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed no-release-notes This pull request does not require release notes entry
Development

Successfully merging this pull request may close these issues.

6 participants