You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We switch to Trino from PrestoDB recently and observed that the size of serialized HiveSplit is significantly larger, resulting in a significant amount of outgoing traffic from Coordinator to Workers, roughly 5x larger, even being capped by how much outgoing bandwidth we have. This will starve the workers in a large cluster.
Upon further investigation, we found out that the problem is that the field schema in HiveSplit is taking up over 95% of the size after serialization.
It seems like this field contains all the table properties stored in Hive metastore for the table/partition referenced in that split, and in our cases, the table is generated by a separate Spark pipeline, and Spark will write tons of extra metadata to table/partition properties, and this may vary on fileformat used.
We have mitigated this problem by filtering out most of the table properties. But I think there should be some filter in place for what table properties to be included in the split.
The text was updated successfully, but these errors were encountered:
We switch to Trino from PrestoDB recently and observed that the size of serialized
HiveSplit
is significantly larger, resulting in a significant amount of outgoing traffic from Coordinator to Workers, roughly 5x larger, even being capped by how much outgoing bandwidth we have. This will starve the workers in a large cluster.Upon further investigation, we found out that the problem is that the field
schema
inHiveSplit
is taking up over 95% of the size after serialization.trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.java
Line 78 in 607decc
It seems like this field contains all the table properties stored in Hive metastore for the table/partition referenced in that split, and in our cases, the table is generated by a separate Spark pipeline, and Spark will write tons of extra metadata to table/partition properties, and this may vary on fileformat used.
We have mitigated this problem by filtering out most of the table properties. But I think there should be some filter in place for what table properties to be included in the split.
The text was updated successfully, but these errors were encountered: