-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiply on disk column sizes of iceberg data files by 4 for column stats #15186
Multiply on disk column sizes of iceberg data files by 4 for column stats #15186
Conversation
96e9019
to
f6d02d0
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsMaker.java
Outdated
Show resolved
Hide resolved
f6d02d0
to
e3eeea3
Compare
please rename: "OCB stats" to "column stats" |
@@ -124,6 +126,11 @@ private TableStatistics makeTableStatistics(IcebergTableHandle tableHandle) | |||
if (summary.getColumnSizes() != null) { | |||
Long columnSize = summary.getColumnSizes().get(fieldId); | |||
if (columnSize != null) { | |||
if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you'd typically static import these constants and compare them using ==
if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) { | ||
// columnSize value is in fact size of column stored on disk which is after compression and is much smaller than | ||
// the column size in memory. Multiplying by 10 seems like a good heuristic to compensate for that | ||
columnSize = columnSize * 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- statistical compression factor for binary is likely different than for text.
i don't know whether we have any "data" to back up the* 10
number - when Iceberg type is
fixed(L)
the Trino type isVARBINARY
, but Iceberg actually knows that each cell isL
bytes. Can we tell the engine about that?
@@ -124,6 +126,11 @@ private TableStatistics makeTableStatistics(IcebergTableHandle tableHandle) | |||
if (summary.getColumnSizes() != null) { | |||
Long columnSize = summary.getColumnSizes().get(fieldId); | |||
if (columnSize != null) { | |||
if (columnHandle.getBaseType().equals(VarcharType.VARCHAR) || columnHandle.getBaseType().equals(VarbinaryType.VARBINARY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something like this
if (columnHandle.getBaseType() instanceof FixedWidthType) {
// columnSize is the size on disk and Trino column stats is size in memory.
// The relation between the two is type and data dependent.
// However, Trino currently does not use data size statistics for fixed-width types
// (it's not needed for them), so do not report it at all, to avoid reporting some bogus value.
}
else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe report columnSize only when column type is varchar/varbinary
The title is wrong. The code change applies to parquet and orc files (and avro) equally. |
Can you also add some test that would exercise Trino SHOW STATS on a table that was created and written by Spark, for each file format? TestIcebergSparkCompatibility current runs SHOW STATS only for tables with int columns |
Is the effect of the changed estimates applicable to |
@raunaqmorarka it's on read path, so no, we don't need to update test data' metadata stored there. |
I see, I'm a bit surprised then that no TPC plan changed as a result of this. |
@raunaqmorarka can it be that both sides are equally "inflated" and so ideal plans don't change? |
There could be impact to choice of broadcast join due to max_broadcast_size threshold being breached. When all the terms are estimated, this would turn a broadcast join into repartitioned join without impacting the ordering. When there are unestimated terms, we fall back to estimates of table sizes for build vs probe choice as well. |
Changing the |
Indeed, with Iceberg Parquet plan tests (#15255), this change would be reflected in changed plans. |
e3eeea3
to
36e86f4
Compare
else { | ||
if (idToType.get(columnHandleTuple.getKey()).typeId() == Type.TypeID.FIXED) { | ||
Types.FixedType fixedType = (Types.FixedType) idToType.get(columnHandleTuple.getKey()); | ||
columnSize = (long) fixedType.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be filled in also when summary.getColumnSizes().get(fieldId) == null
} | ||
else if (columnHandle.getBaseType() == VARBINARY) { | ||
// Tests showed that for VARCHAR multiplying by 1.4 seems like a good heuristic | ||
columnSize = (long) (columnSize * 1.4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any comment how 1.4 was chosen? (same for 2.7 above)
Can you add a comment?
initially we thought 10x will be OK
are you measuring on some "real data", or using the "random hex" test case we run internally?
also, note that size on disk can be very small when data is dictionary encoded. maybe it's safer to overshoot rather than undershoot? cc @raunaqmorarka @sopel39 for over- vs under-estimation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added.
// Tested using item table from TPCDS benchmark
// compared column size of item_desc column stored inside files
// with length of values in that column reported by trino
columnSize = (long) (columnSize * 2.7);
}
else if (columnHandle.getBaseType() == VARBINARY) {
// Tested using VARBINARY columns with random, both in length and content, data
// compared column size stored inside parquet files with length of bytes written to it
columnSize = (long) (columnSize * 1.4);
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Tested using item table from TPCDS benchmark
// compared column size of item_desc column stored inside files
// with length of values in that column reported by trino
columnSize = (long) (columnSize * 2.7);
this is awesome!
else if (columnHandle.getBaseType() == VARBINARY) {
// Tested using VARBINARY columns with random, both in length and content, data
// compared column size stored inside parquet files with length of bytes written to it
columnSize = (long) (columnSize * 1.4);
how was the data randomized?
truly random data should not compress at all, so you should have gotten ~1.0 factor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took random strings, change them to byte and save that data, then I compared size on disk with lengths of byte arrays in memory. If you think that we should have 1.0 here then I can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random ascii or radnom full Unicode?
If you think that we should have 1.0 here then I can do it.
i did not say that. I am asking about methodology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used org.apache.commons.lang.RandomStringUtils
. From its javadoc: Characters will be chosen from the set of alpha-numeric characters as indicated by the arguments
i did not say that. I am asking about methodology.
Sure I know but when I thought more about it I tend to agree that this should be close to 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be close to 1 for random bytes.
Since you were generating alpha-numeric characters, the byte repertoire was limited, thus data was compressible, and hence > 1.0 factor.
I am fine keeping the 1.4 value, but we need to code-comment that the methodology was not very real-life-like and some better heuristic value could be proposed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is it better to keep it 1.4 with code comment or to change it to 1 ? What do you think is betteR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big difference, let's go with a bigger number.
The only important part is the code comment -- don't pretend this is some "very smart value"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, note that size on disk can be very small when data is dictionary encoded. maybe it's safer to overshoot rather than undershoot? cc @raunaqmorarka @sopel39 for over- vs under-estimation
Overshooting is safer.
Please rebase on #15255 (or wait for it to be merged). cc @nineinchnick @przemekak for how can we run benchmarks most easily for this change? |
@findepi I think atm the easiest way is to run this workflow: https://github.com/starburstdata/benchmarks-gha/actions/workflows/standard-benchmarks.yaml |
ddfd5ec
to
0d08200
Compare
@homar thank you for re-running the CI. |
CI hit #15313 |
0d08200
to
249e0de
Compare
I don't see any new comment under that issue. |
I only meant to mention that issue because the same test failed here. I probably should have pasted logs from my failure there but I didn't and now i can't access them. sorry |
no problem btw all past runs are accessible here: https://github.com/trinodb/trino/actions/workflows/ci.yml?query=branch%3Ahomar%2Fcollect_memory_column_sizes_for_iceberg_with_parquet |
249e0de
to
37e7224
Compare
37e7224
to
bc1c0a4
Compare
homar test column sizes.pdf |
It looks like there is an improvement for TPC-DS. |
There are some ups and downs in the results, but overall they look OK to me. (i defer to @sopel39 and @raunaqmorarka on final judgement) |
@sopel39 no (see https://github.com/trinodb/trino/pull/15186/files) |
@sopel39 @raunaqmorarka any decision? |
@homar I say go ahead |
Description
These stats are used by CBO that expects sizes of data while it resides in memory.
The idea is to use column-sizes from Iceberg metadata stored there by parquet writer. From some tests I ran it looks like multiplying by 10 looks like a good heuristic.
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: