Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add varchar to boolean coercion for hive tables #20741

Merged

Conversation

Praveen2112
Copy link
Member

@Praveen2112 Praveen2112 commented Feb 19, 2024

Description

Add varchar to boolean coercion for hive tables.

Enable the Hive users to retrieve the information from their tables after performing queries which change varchar/ string columns to boolean.

Hive sample DDL query:

ALTER TABLE mytable CHANGE COLUMN mycolumn mycolumn boolean;

The coercion logic for non ORC files is available here - https://github.com/apache/hive/blob/4df4d75bf1e16fe0af75aad0b4179c34c07fc975/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L559

The coercion logic for ORC files is available here - https://github.com/apache/orc/blob/fb1c4cb9461d207db652fc253396e57640ed805b/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L567

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
# Hive
* Add varchar to boolean coercion for hive tables. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Feb 19, 2024
@Praveen2112 Praveen2112 force-pushed the praveen/varchar_to_boolean_coercer branch 3 times, most recently from 701b4ba to 049241a Compare February 19, 2024 15:15
@Praveen2112 Praveen2112 requested a review from kokosing February 20, 2024 07:58
@Praveen2112 Praveen2112 force-pushed the praveen/varchar_to_boolean_coercer branch from 049241a to c1f1ae8 Compare February 20, 2024 10:01
assertVarcharToBooleanCoercion("OFF", false);
assertVarcharToBooleanCoercion("NO", false);
assertVarcharToBooleanCoercion("0", false);
assertVarcharToBooleanCoercion("", false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behaviour for space " " in Hive? Can we add test for this?

Copy link
Member Author

@Praveen2112 Praveen2112 Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced 'ON` with ' ' so as to verify the behavior

Unless the test requires additional modification by default we can
set HiveTimestampPrecision to DEFAULT and not treating NaN as null.
Rename `treatNaNAsNull` to `isOrcFile` as the changes in coercion are
seen only in case of ORC files
@Praveen2112 Praveen2112 force-pushed the praveen/varchar_to_boolean_coercer branch from c1f1ae8 to e1a6872 Compare February 20, 2024 11:46
@@ -73,6 +75,9 @@ private OrcTypeTranslator() {}
return Optional.of(new DateToVarcharCoercer(varcharType));
}
if (isVarcharType(fromOrcType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaffolding with isOrcFile is done for ORC only.
OrcPageSourceFactory is using though OrcTypeTranslator (for both partitioned and unpartitioned tables) where we know that we're dealing with ORC.
The CoercionUtils class (which makes use of CoercionContext) has actually nothing to do with ORC changes at the moment.

Instead of adding special logic for ORC in the XToYCoercer classes, I'd be in favor of having separate coercer classes:

  • XToYCoercer
  • XToYOrcCoercer

This change would keep the XToYCoercer classes easier to follow.
I don't feel strongly about this chnage because the coercer classes are simple at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaffolding with isOrcFile is done for ORC only.

True

OrcPageSourceFactory is using though OrcTypeTranslator (for both partitioned and unpartitioned tables) where we know that we're dealing with ORC.

This one is partially true - Though OrcTypeTranslator is well equipped with handling the coercions for partitioned table as well - We still rely on CoercionPolicy or CoercionUtil to control the coercion for partitioned layer - Today the coercion are applied at a layer above the file format specfic PageSourceProvider and we don't want to split it for each file format. So still for ORC files in case of partitioned tables we rely on CoercionUtils

I'd be in favor of having separate coercer classes:

We can handle them but for boolean, double or float the changes are fairly simple so we have constrained them within a class but for numeric coercions #20766 we have implemented something like a seperate coercer for ORC

assertVarcharToBooleanCoercion("TR", true);
assertVarcharToBooleanCoercion("T", true);
assertVarcharToBooleanCoercion("Y", true);
assertVarcharToBooleanCoercion("1", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test for 0 in ORC as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have added it in testVarcharToBooleanForOrc

@Praveen2112 Praveen2112 force-pushed the praveen/varchar_to_boolean_coercer branch from e1a6872 to 03262c4 Compare February 20, 2024 12:14
@Praveen2112
Copy link
Member Author

@findinpath Based on this comment - I have experimented to extract a dedicated coercer for ORC

@findinpath findinpath requested a review from krvikash February 20, 2024 13:16
@Praveen2112 Praveen2112 force-pushed the praveen/varchar_to_boolean_coercer branch from 03262c4 to e5e30f9 Compare February 20, 2024 13:26
}
}

public static class OrcVarcharToBooleanCoercer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is is sad that ORC is special :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

@Praveen2112 Praveen2112 force-pushed the praveen/varchar_to_boolean_coercer branch from e5e30f9 to a4dfcf6 Compare February 21, 2024 06:23
@Praveen2112
Copy link
Member Author

@kokosing AC

@Praveen2112 Praveen2112 merged commit 501a9eb into trinodb:master Feb 21, 2024
61 checks passed
@github-actions github-actions bot added this to the 440 milestone Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants