-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce HiveSplit size for ORC, Parquet and RCFile tables #15601
Conversation
It shows 4x reduction in serialized HiveSplit size for TPC-H / TPC-DS schemas stored in ORC |
"natively" |
@@ -165,6 +166,16 @@ public OrcPageSourceFactory( | |||
this.fileSystemFactory = requireNonNull(fileSystemFactory, "fileSystemFactory is null"); | |||
} | |||
|
|||
public static Properties stripUnnecessaryProperties(Properties schema) | |||
{ | |||
if (isDeserializerClass(schema, OrcSerde.class) && !isFullAcidTable(Maps.fromProperties(schema))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we strip things other than [transactional_properties
, transactional
] for acid tables as well - Or we're prefer to future-proof for unintended deviations in the hive-apache code? (Although since this involves metadata state stored in the metastore, any changes in hive code would need to be backwards compatible anyway ? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to cover trivial cases for now and somehow I thought that ACID tables are not that common. But yeah, you are right, it looks like technically we only need the transactional_properties
and transactional
properties. I would prefer to open a follow up PR though to do not delay the current one from landing.
{ | ||
if (isDeserializerClass(schema, OrcSerde.class) && !isFullAcidTable(Maps.fromProperties(schema))) { | ||
Properties stripped = new Properties(); | ||
stripped.put(SERIALIZATION_LIB, schema.getProperty(SERIALIZATION_LIB)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if schema.getProperty(SERIALIZATION_LIB)
exists before putting? (in all 3 methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implied by the isDeserializerClass(schema, OrcSerde.class)
. Do you think it is worth adding an explicit check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. No, just asking
Description
The
schema
field can grow quite large as it includes the full set of table and serde properties as well as columns, types and many other things. Storing, serializing and transferring it can be quite costly. Luckily it is possible to avoid it for the natively supported formats.Additional context and related issues
#15511
Release notes
(X) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: