Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hudi connector #17149

Merged
merged 1 commit into from
Jun 13, 2022
Merged

Add hudi connector #17149

merged 1 commit into from
Jun 13, 2022

Conversation

7c00
Copy link
Member

@7c00 7c00 commented Dec 31, 2021

Add hudi connector

Ref: #17006

Test plan - Unit tests.

== RELEASE NOTES ==

Connectors
* Introduce Hudi connector

@rongrong rongrong marked this pull request as ready for review January 13, 2022 03:23
@7c00 7c00 force-pushed the presto-hudi branch 3 times, most recently from dc1acb2 to 75b1762 Compare January 17, 2022 03:12
@codope
Copy link
Contributor

codope commented Jan 20, 2022

@rohanpednekar
Copy link
Contributor

CC @pratyakshsharma @imjalpreet

@7c00 7c00 force-pushed the presto-hudi branch 2 times, most recently from fbb24fd to df82b6d Compare January 27, 2022 02:47
@7c00
Copy link
Member Author

7c00 commented Jan 27, 2022

Thanks @codope @rohanpednekar!
Hello @pratyakshsharma @imjalpreet @yihua @bhasudha @garyli1019 @agrawaldevesh, could you please help review this PR? Thanks in advance.

@7c00 7c00 force-pushed the presto-hudi branch 2 times, most recently from d3c130c to 06ab601 Compare January 29, 2022 07:55
Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@7c00 Thanks for the contribution. This would be pretty useful for the Hudi community.

I took one pass and left some comments. I have a high-level question. Can you please add to the description, what all Hudi table types and query types are currently supported?

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
presto-hudi/pom.xml Outdated Show resolved Hide resolved
private static InputFormat<?, ?> createInputFormat(Configuration conf, String inputFormat)
{
try {
Class<?> clazz = conf.getClassByName(inputFormat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reflection-based check, we can use annotations provided by Hudi. Check these two annotations in Hudi:

UseRecordReaderFromInputFormat
UseFileSplitsFromInputFormat

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotations UseRecordReaderFromInputFormat and UseFileSplitsFromInputFormat indicate that we could delegate the splitting to InputFormat. As we discussed in the design proposal, there are some performance issues in the splitting implementation from InputFormat. So we will ignore these annotations and provide a new splitting implementation optimized for Presto.

format("Invalid partition name %s for partition columns %s", partitionName, partitionColumns));
Partition partition = metastore.getPartition(context, databaseName, tableName, partitionValues)
.orElseThrow(() -> new PrestoException(HUDI_INVALID_METADATA, format("Partition %s expected but not found", partitionName)));
Map<String, String> keyValues = zipPartitionKeyValues(partitionColumns, partitionValues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you're considering different partition extractors in hudi-hive sync. Depending on partition extractor, partition values can change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now retrieving partition metadata through HiveMetaStore, where the partition values stored have been extracted by some partition extractors from their partition locations and thus we dont need to take consisder of the partition extractors.

@7c00 7c00 force-pushed the presto-hudi branch 2 times, most recently from 83edce7 to 23c5fd9 Compare February 14, 2022 12:22
@7c00
Copy link
Member Author

7c00 commented Feb 14, 2022

@pratyakshsharma @codope Thanks for your comments. I have updated the code as the comments. Could you please take a second review? Thanks in advance.

@7c00 7c00 requested a review from pratyakshsharma February 14, 2022 13:00
@pratyakshsharma
Copy link
Contributor

@codope Can you please take another pass on this?

@codope
Copy link
Contributor

codope commented Feb 21, 2022

@pratyakshsharma @codope Thanks for your comments. I have updated the code as the comments. Could you please take a second review? Thanks in advance.

Will review by Wednesday.

@7c00
Copy link
Member Author

7c00 commented Feb 25, 2022

ping @codope

@codope
Copy link
Contributor

codope commented Feb 25, 2022

@7c00 I have taken one more pass. The code look fine from Hudi perspective.
@arunthirupathi Could you or someone in the community help in taking this forward?
cc @agrawaldevesh

@arunthirupathi
Copy link

This contains 2 set of changes, changes to hive/metastore and hudi connector. Can you please do a PR for hive/metastore changes alone and hudi connector.

I will review the hive/metastore changes closely.
Hudi, if you folks are happy with it, I will glance through it and help you in merging.

@7c00
Copy link
Member Author

7c00 commented Feb 28, 2022

@arunthirupathi I have extracted the metastore related changes to a new PR at #17368. Could you take a review on it? Thank you in advance ❤️ .

@7c00 7c00 force-pushed the presto-hudi branch 2 times, most recently from aa78a34 to db07aa6 Compare March 10, 2022 13:58
@7c00
Copy link
Member Author

7c00 commented Mar 10, 2022

I have refactored the tests. Instead of dynamically generating tables during testing, the new tests are using the pre-generated data from Hudi Docker demo. The reasons for that are

  • it's hard to make PrestoExtendedFileSystemCache happy with HoodieWrapperFileSystem; the issue was walked around in a very tricky way;
  • HoodieJavaClient cannot create a MOR table with log files, which makes the tests imcomplete

@codope @pratyakshsharma Could you take a second review? Nearly all the changes comes from the test part. Then I'd like to invite Presto fellows to review the PR.

@pratyakshsharma
Copy link
Contributor

I am actually travelling this weekend. Will take a look early next week.

@arunthirupathi
Copy link

What is the size of the jar that is getting built ?few 10's of MB is fine, but if it is producing really large then please look into it in reducing the size.

When it is ready for review, Please ping me again.

For specific connectors, we require it to be thoroughly reviewed by someone else familiar with the connector. Once that review is done, I will make a pass to ensure it meets the coding guidelines.

@7c00
Copy link
Member Author

7c00 commented Apr 11, 2022

@arunthirupathi Sorry for late to reply.

What is the size of the jar that is getting built ?few 10's of MB is fine, but if it is producing really large then please look into it in reducing the size.

As we are reusing dependencies from presto-hive module and have added no new dependency. The total size of this hudi connector will be less than that of hive connector.

When it is ready for review, Please ping me again.

For specific connectors, we require it to be thoroughly reviewed by someone else familiar with the connector. Once that review is done, I will make a pass to ensure it meets the coding guidelines.

Now this PR is blocked the Avro dependency issue (found in #17463) to be fixed in next Hudi release 0.11. When Hudi 0.11 is released, we are going to continue to work on this PR and invite you to review. Thanks you in advance.

cc @codope

@codope
Copy link
Contributor

codope commented May 2, 2022

Now this PR is blocked the Avro dependency issue (found in #17463) to be fixed in next Hudi release 0.11. When Hudi 0.11 is released, we are going to continue to work on this PR and invite you to review. Thanks you in advance.

@7c00 Hudi v0.11 is released https://mvnrepository.com/artifact/org.apache.hudi/hudi-presto-bundle/0.11.0
Can you please upgrade and rebase this PR?

@7c00
Copy link
Member Author

7c00 commented May 25, 2022

The code has been rebased to sync with master branch after #17463 . It's ready to review now.

@codope @pratyakshsharma @arunthirupathi @kewang1024 @highker Could you take a look at this PR? Thanks in advance! ❤️

@pratyakshsharma
Copy link
Contributor

@pettyjamesm Please take a look when you get a chance.

Copy link

@arunthirupathi arunthirupathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, left some nits.

Sorry my notification in Presto is messed up, I will fix it. so that I will have a quicker turn around time next time.

@Override
public String toString()
{
return id + ":" + name + ":" + hiveType + ":" + columnType;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please StringHelper to convert to String ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The toString method here is implemented followed the convention of HiveColumnHandle#toString. In this way, we could have a more compacted text (refer to a Hudi column) when output logs.

@Override
public String toString()
{
return path + ":" + start + "+" + length;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reason as HudiColumnHandle#toString

Comment on lines +129 to +147
int batchSize = dataPage.getPositionCount();
Block[] blocks = new Block[prefilledBlocks.length];
for (int i = 0; i < prefilledBlocks.length; i++) {
if (prefilledBlocks[i] != null) {
blocks[i] = new RunLengthEncodedBlock(prefilledBlocks[i], batchSize);
}
else {
blocks[i] = dataPage.getBlock(delegateIndexes[i]);
}
}
return new Page(batchSize, blocks);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code though is right, is inefficient. This is one of the hot paths. If there is no prefilled blocks, short circuit and return the delegate page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you. Add short circuit when no actual prefilled block.

if (valueString.equalsIgnoreCase("false")) {
return false;
}
throw new IllegalArgumentException();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the value here on what was illegal, so that it makes debugging easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IllegalArgumentException is catched later in current method, then a PrestoException with more context infomation is thrown.

BigDecimal decimal = new BigDecimal(valueString);
decimal = decimal.setScale(decimalType.getScale(), BigDecimal.ROUND_UNNECESSARY);
if (decimal.precision() > decimalType.getPrecision()) {
throw new IllegalArgumentException();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the decimalPrecision and decimalType precision in the error message, along with the valueString.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reaseon as above.

@codope
Copy link
Contributor

codope commented Jun 13, 2022

Thanks for reviewing @arunthirupathi
We'll address the comments shortly.

@arunthirupathi
Copy link

Looks like there were some changes and this was broken, can you please fix them as well ?

@7c00
Copy link
Member Author

7c00 commented Jun 13, 2022

@arunthirupathi Thanks for your reviewing. The comments have been addressed. Let me ping you again when all CI checks passed.

@7c00
Copy link
Member Author

7c00 commented Jun 13, 2022

Hey @arunthirupathi All CI passed. Let's merge this PR?

@arunthirupathi arunthirupathi merged commit 6ecfd81 into prestodb:master Jun 13, 2022
@7c00
Copy link
Member Author

7c00 commented Jun 14, 2022

Thanks @arunthirupathi @codope @pratyakshsharma @rohanpednekar for all your time and help! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants