Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce HiveSplit size for ORC, Parquet and RCFile tables #15601

Merged
merged 1 commit into from
Jan 5, 2023

Conversation

arhimondr
Copy link
Contributor

@arhimondr arhimondr commented Jan 4, 2023

Description

The schema field can grow quite large as it includes the full set of table and serde properties as well as columns, types and many other things. Storing, serializing and transferring it can be quite costly. Luckily it is possible to avoid it for the natively supported formats.

Additional context and related issues

#15511

Release notes

(X) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@arhimondr
Copy link
Contributor Author

It shows 4x reduction in serialized HiveSplit size for TPC-H / TPC-DS schemas stored in ORC

@findepi
Copy link
Member

findepi commented Jan 4, 2023

Luckily it is possible to avoid it for the naively supported formats.

"natively"

@@ -165,6 +166,16 @@ public OrcPageSourceFactory(
this.fileSystemFactory = requireNonNull(fileSystemFactory, "fileSystemFactory is null");
}

public static Properties stripUnnecessaryProperties(Properties schema)
{
if (isDeserializerClass(schema, OrcSerde.class) && !isFullAcidTable(Maps.fromProperties(schema))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we strip things other than [transactional_properties, transactional] for acid tables as well - Or we're prefer to future-proof for unintended deviations in the hive-apache code? (Although since this involves metadata state stored in the metastore, any changes in hive code would need to be backwards compatible anyway ? )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to cover trivial cases for now and somehow I thought that ACID tables are not that common. But yeah, you are right, it looks like technically we only need the transactional_properties and transactional properties. I would prefer to open a follow up PR though to do not delay the current one from landing.

@assaf2 assaf2 self-requested a review January 5, 2023 10:14
{
if (isDeserializerClass(schema, OrcSerde.class) && !isFullAcidTable(Maps.fromProperties(schema))) {
Properties stripped = new Properties();
stripped.put(SERIALIZATION_LIB, schema.getProperty(SERIALIZATION_LIB));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if schema.getProperty(SERIALIZATION_LIB) exists before putting? (in all 3 methods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implied by the isDeserializerClass(schema, OrcSerde.class). Do you think it is worth adding an explicit check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. No, just asking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants