Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table properties in HiveSplit exausts outgoing bandwidth of coordinator #16081

Closed
wsmithril opened this issue Feb 13, 2023 · 2 comments
Closed

Comments

@wsmithril
Copy link

We switch to Trino from PrestoDB recently and observed that the size of serialized HiveSplit is significantly larger, resulting in a significant amount of outgoing traffic from Coordinator to Workers, roughly 5x larger, even being capped by how much outgoing bandwidth we have. This will starve the workers in a large cluster.

Upon further investigation, we found out that the problem is that the field schema in HiveSplit is taking up over 95% of the size after serialization.

@JsonProperty("schema") Properties schema,

It seems like this field contains all the table properties stored in Hive metastore for the table/partition referenced in that split, and in our cases, the table is generated by a separate Spark pipeline, and Spark will write tons of extra metadata to table/partition properties, and this may vary on fileformat used.

We have mitigated this problem by filtering out most of the table properties. But I think there should be some filter in place for what table properties to be included in the split.

@raunaqmorarka
Copy link
Member

Which release are you using ?
This was addressed by #15601 in 406 release.

@wsmithril
Copy link
Author

It's an older release 377. And that commit fixes this issue. I'll close this issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants