Table properties in HiveSplit exausts outgoing bandwidth of coordinator #16081

wsmithril · 2023-02-13T07:19:57Z

We switch to Trino from PrestoDB recently and observed that the size of serialized HiveSplit is significantly larger, resulting in a significant amount of outgoing traffic from Coordinator to Workers, roughly 5x larger, even being capped by how much outgoing bandwidth we have. This will starve the workers in a large cluster.

Upon further investigation, we found out that the problem is that the field schema in HiveSplit is taking up over 95% of the size after serialization.

trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.java

Line 78 in 607decc

@JsonProperty("schema") Properties schema,

It seems like this field contains all the table properties stored in Hive metastore for the table/partition referenced in that split, and in our cases, the table is generated by a separate Spark pipeline, and Spark will write tons of extra metadata to table/partition properties, and this may vary on fileformat used.

We have mitigated this problem by filtering out most of the table properties. But I think there should be some filter in place for what table properties to be included in the split.

The text was updated successfully, but these errors were encountered:

raunaqmorarka · 2023-02-13T09:22:18Z

Which release are you using ?
This was addressed by #15601 in 406 release.

wsmithril · 2023-02-13T09:57:43Z

It's an older release 377. And that commit fixes this issue. I'll close this issue. Thanks.

wsmithril closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table properties in HiveSplit exausts outgoing bandwidth of coordinator #16081

Table properties in HiveSplit exausts outgoing bandwidth of coordinator #16081

wsmithril commented Feb 13, 2023

raunaqmorarka commented Feb 13, 2023

wsmithril commented Feb 13, 2023

Table properties in HiveSplit exausts outgoing bandwidth of coordinator #16081

Table properties in HiveSplit exausts outgoing bandwidth of coordinator #16081

Comments

wsmithril commented Feb 13, 2023

raunaqmorarka commented Feb 13, 2023

wsmithril commented Feb 13, 2023