Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dereference operations for union type in Hive Connector #15278

Merged
merged 1 commit into from
Dec 12, 2022

Conversation

leetcode-1533
Copy link
Contributor

@leetcode-1533 leetcode-1533 commented Dec 2, 2022

Description

Support dereferencing using field names such as "field0", "field1", "tag" for Hive's Union Type.

Additional context and related issues

#15017,
#3483,
e071da4

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive Connector
*  Fix queries referencing nested fields in union typed columns. ({issue}`15278`)

@cla-bot cla-bot bot added the cla-signed label Dec 2, 2022
@leetcode-1533 leetcode-1533 marked this pull request as draft December 2, 2022 08:56
@leetcode-1533 leetcode-1533 marked this pull request as ready for review December 2, 2022 22:51
@phd3 phd3 self-requested a review December 2, 2022 23:02
@phd3 phd3 changed the title Support dereference for union type Fix dereference operations for union type in Hive Connector Dec 4, 2022
private void testAvroUnionTypeDereference(String tableName)
{
/*
* On hive 1.2, when create AVRO table with nested format, such as: uniontype<struct<unionlevel1:uniontype<string,int>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this part ... Can you please elaborate what's the issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested, nested column I created: union<struct>> get interpreted as union<> for Avro type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue is unrelated to the fix implementation. Due to this I can't test nested nested case with Avro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. since union types typically have multiple fields, curious if we have something like union<int, <struct<int, int>> , what is it read as?

@cla-bot
Copy link

cla-bot bot commented Dec 7, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

Copy link
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, but overall looks good.
please look at build failure and also git author details - currently it's showing cla not signed.

private void testAvroUnionTypeDereference(String tableName)
{
/*
* On hive 1.2, when create AVRO table with nested format, such as: uniontype<struct<unionlevel1:uniontype<string,int>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. since union types typically have multiple fields, curious if we have something like union<int, <struct<int, int>> , what is it read as?

@leetcode-1533 leetcode-1533 force-pushed the unionDereference branch 2 times, most recently from 9eb679c to 2f5d20a Compare December 8, 2022 00:26
@phd3
Copy link
Member

phd3 commented Dec 8, 2022

I tried this out for avro. looks like when we've just one type within uniontype, hive tries to be smart about it in terms of table schema by removing the union but then fails insertion.

e.g. this fails

use default;
drop table complexavrobase;
create table complexavrobase (c0 uniontype<struct<a:int,b:string,c:uniontype<string,int>>>, c1 int) stored as AVRO;
insert into complexavrobase SELECT create_union(0, named_struct('a', 1, 'b', 'structval', 'c', create_union(1, 'd', 5))), 8 FROM (select 'ignore') ignore;

But if we actually have a union of more than one type - then it succeeds.

use default;
drop table complexavro;
create table complexavro (c0 uniontype<int,string,struct<a:int,b:string,c:uniontype<string,int>>>, c1 int) stored as AVRO;
insert into complexavro SELECT create_union(0, 100,  'first', named_struct('a', 1, 'b', 'structval', 'c', create_union(0, 'd', 5))), 8 from (select 'ignore') ignore;
insert into complexavro SELECT create_union(1, 100,  'first', named_struct('a', 1, 'b', 'structval', 'c', create_union(0, 'd', 5))), 8 from (select 'ignore') ignore;
SELECT * FROM complexavro;
insert into complexavro SELECT create_union(2, 100,  'first', named_struct('a', 1, 'b', 'structval', 'c', create_union(0, 'd', 5))), 8 from (select 'ignore') ignore;

However, hive fails to read back results after this. But IMO Trino should still be able to read back. Could you please modify the test so that we've coverage for both ?

@leetcode-1533
Copy link
Contributor Author

e.g. this fails

I would keep the original test as well, since there is no constraint specifying union type must have more than one type.

@leetcode-1533 leetcode-1533 force-pushed the unionDereference branch 2 times, most recently from f89579c to e144e74 Compare December 8, 2022 22:13
@leetcode-1533
Copy link
Contributor Author

pushed, please take a further look!

@leetcode-1533
Copy link
Contributor Author

Summary of bugs we found using Hive inserting union columns for AVRO format table:

  1. When inserting values into nested Union<Struct<>> whereas Union only has one field that is struct, we found Union level were ignored. I.e. when "show create table", the schema is correct, but when inserting, Hive complains that schema mismatch.
  2. When inserting values into nested Union<Struct<Union<int, string >>, we found we have to insert using create_union(tagId, string, int) instead of create_union(tagId, int string)

@phd3 phd3 merged commit a46a510 into trinodb:master Dec 12, 2022
@github-actions github-actions bot added this to the 404 milestone Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants