Support DataSkipping for hudi connector #18606

xiarixiaoyao · 2022-11-02T12:33:39Z

What's the change?

support DataSkipping for hudi connector.
support partition prune by hudi mdt to reduce rpc calls for hive
support filter push down for hudi cow table.

design

test result:
ssb benchmark,
datasize: 1.5TB, 12billion
env: 1CN+3WN Container 170GB，136GB JVM heap, 95GB Max Query Memory，40vcore

Test plan - (Please fill in how you tested your changes)

Test plan - unit test

== NO RELEASE NOTE ==
General Changes
* support dataSkipping for hudi connector.
* support partition prune by hudi mdt to reduce rpc calls for hive
* support filter push down for hudi cow table.

linux-foundation-easycla · 2022-11-02T12:33:44Z

The committers listed above are authorized under a signed CLA.

✅ login: xiarixiaoyao (d0e6d469b47f3de0bc464fc2cc6228fb6a6ab7fd)

pratyakshsharma · 2022-11-03T13:30:28Z

Will have a look at it over the weekend.

pratyakshsharma

Thank you for raising this detailed PR. My other PR based on RFC-58 was trying to do data skipping using metadata table in a more generic way. Basically rather than introducing the changes in specific query engines like presto and trino, the idea was to introduce the changes as part of hudi itself and simply call them from presto/trino.
Anyways I can make changes accordingly later. I am still going through the changes and have given few comments for changes/clarification. Please have a look.

pratyakshsharma · 2022-11-11T07:34:34Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+    {
+        SchemaEvolutionContext schemaEvolutionContext = hudiSplit.getSchemaEvolutionContext();
+        InternalSchema internalSchema = SerDeHelper.fromJson(schemaEvolutionContext.getRequiredSchema()).orElse(InternalSchema.getEmptyInternalSchema());
+        if (internalSchema.isEmptySchema() || !hudiSplit.getBaseFile().isPresent()) {


Just trying to understand why we cannot do schema evolution if only log files are present?

We have processed the mor table in the hudi kernel
apache/hudi#6989

Thank you for pointing me to this PR, will have a look and ask doubts, if any

pratyakshsharma · 2022-11-11T07:39:31Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+
+    public static SchemaEvolutionContext createSchemaEvolutionContext(HoodieTableMetaClient metaClient)
+    {
+        // no need to do schema evolution for mor table, since hudi kernel will do it.


Can you please explain why there is no need to do schema evolution for MoR tables? Are we taking care of schema evolution with the getSplits call for MoR in Hudi kernel?

We have processed the mor table in the hudi kernel
apache/hudi#6989

pratyakshsharma · 2022-11-11T07:42:01Z

presto-hudi/src/main/java/com/facebook/presto/hudi/SchemaEvolutionContext.java

+
+public class SchemaEvolutionContext
+{
+    private final String requiredSchema;


can we add one line comment for these 2 variables here?

Also if I understand properly, this variable corresponds to latest internal schema, probably we can update the variable name too?

yes，Thank you for your good suggest

pratyakshsharma · 2022-11-11T07:44:54Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+            TableSchemaResolver schemaUtil = new TableSchemaResolver(metaClient);
+            String internalSchema = schemaUtil.getTableInternalSchemaFromCommitMetadata().map(SerDeHelper::toJson).orElse("");
+            HoodieTimeline hoodieTimeline = metaClient.getCommitsAndCompactionTimeline().filterCompletedInstants();
+            String validCommits = hoodieTimeline.getInstants().map(HoodieInstant::getFileName).collect(Collectors.joining(","));


I guess better to update the variable name to validCommitFiles and also update this variable in SchemaEvolutionContext class.

I guess it will be good to have some test cases covering the scenarios of different types of schema evolutions. That should clear most of my doubts as well.

agree , as apache/hudi#6989 has merged in hudi kernel, we should add test cases to convering schema evolution.

pratyakshsharma · 2022-11-11T17:38:55Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSplitManager.java

+        Map<String, Map<String, String>> partitionMap = HudiPartitionManager
+                .getPartitions(partitionColumns.stream().map(f -> f.getName()).collect(Collectors.toList()), partitions);
+        if (partitions.size() == 1 && partitions.get(0).isEmpty()) {
+            // non-non-partitioned


nit: non-partitioned

pratyakshsharma · 2022-11-11T18:22:47Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

+        }).map(entry -> entry.getKey()).collect(Collectors.toList());
+    }
+
+    private boolean evaluatePartitionPredice(TupleDomain<String> partitionPredicate, List<HudiColumnHandle> partitionColumnHandles, String partitionPathValue, String partitionName)


nit: rename to evaluatePartitionPredicate?

Thank you for raising this detailed PR. My other PR based on RFC-58 was trying to do data skipping using metadata table in a more generic way. Basically rather than introducing the changes in specific query engines like presto and trino, the idea was to introduce the changes as part of hudi itself and simply call them from presto/trino. Anyways I can make changes accordingly later. I am still going through the changes and have given few comments for changes/clarification. Please have a look.

yes If rfc-58 is completed, we only need to convert the presto filter into a hudi filter, and then call the interface directly, just like iceberg. and also RFC-64 is abstracting interfaces, however this may take a long time.

once rfc-58/rfc-64 completed, we can remove those logical directly.

Exactly, I am aligned on this.

pratyakshsharma

I guess it will be good to have some test cases covering the scenarios of different types of schema evolutions. That should clear most of my doubts as well.

pratyakshsharma · 2022-11-12T08:42:32Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+        String commitTime = FSUtils.getCommitTime(baseFilePath.getName());
+        InternalSchema fileSchema = InternalSchemaCache.getInternalSchemaByVersionId(Long.parseUnsignedLong(commitTime), tablePath, hadoopConf, schemaEvolutionContext.getValidCommits());
+        log.debug(String.format(Locale.ENGLISH, "File schema from hudi base file is %s", fileSchema));
+        //


nit: you probably wanted to add some comment here?

sorry， forget add comments

pratyakshsharma · 2022-11-12T08:51:22Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+            return Pair.of(oldColumnHandle, ImmutableMap.of());
+        }
+        // prune internalSchema: columns prune
+        InternalSchema prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchema, oldColumnHandle.stream().map(HudiColumnHandle::getName).collect(Collectors.toList()));


Just thinking out loud, please correct me if I am wrong. oldColumnHandle list comes from metastore and it will have actual columns present in the metastore. There can be a case where latest commit resulted in some column deletion and user did not run hiveSync, so the latest schema was not synced with HMS. Now if you call pruneInternalSchema, it can result in prunedSchema having less number of columns than oldColumnHandle.

good question.
At present, hudi cannot guarantee that the metadata in hive is consistent with the metadata of the current table, and users need to ensure that. this is a big problem.

In this case, we it will be better to throw an exception directly and prompt the user that the metadata information of the current hive table is inconsistent with the data information of the hudi table

WDYT？

Yeah this seems to be a good approach. let us do this. Also would like to hear @codope's thoughts on this.

We should throw an error as metastore is behind hudi table and needs to be synced again.

pratyakshsharma · 2022-11-12T08:54:14Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+        ImmutableList.Builder<HudiColumnHandle> builder = ImmutableList.builder();
+        for (int i = 0; i < oldColumnHandle.size(); i++) {
+            HiveType hiveType = constructPrestoTypeFromType(mergedSchema.columns().get(i).type());
+            HudiColumnHandle hudiColumnHandle = oldColumnHandle.get(i);


Please refer the comment on line 87 above. Now mergedSchema has the same number of columns as prunedSchema and this can be smaller than oldColumnHandle's size. This can create problems with the above logic. WDYT? @xiarixiaoyao

pratyakshsharma · 2022-11-12T08:56:31Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+        return Pair.of(builder.build(), collectTypeChangedCols(prunedSchema, mergedSchema));
+    }
+
+    private static Map<String, HiveType> collectTypeChangedCols(InternalSchema schema, InternalSchema oldSchema)


nit: Maybe change the name of oldSchema to querySchema?

pratyakshsharma · 2022-11-12T09:02:25Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPageSource.java

+            if (!columnCoercions.isEmpty() &&
+                    columnCoercions.containsKey(column.getName()) &&
+                    !column.getHiveType().equals(columnCoercions.get(column.getName()))) {
+                coercersBuilder.add(Optional.of(HiveCoercer.createCoercer(typeManager, columnCoercions.get(column.getName()), column.getHiveType())));


columnCoercions has the new HiveType of columns after doing schema evolution. Should we reverse the last 2 variables in this call? The method signature goes like this - static HiveCoercer createCoercer(TypeManager typeManager, HiveType fromHiveType, HiveType toHiveType). That is why thinking maybe column.getHiveType() should be the second parameter in this call? Please correct me if I am wrong.

Good catch! What @pratyakshsharma is suggesting seems right.

static HiveCoercer createCoercer(TypeManager typeManager, HiveType fromHiveType, HiveType toHiveType)

column.getHiveType() is column type from hive metastore, In theory, it is the latest schema
columnCoercions.get(column.getName()) return a old hive type（not the new type） before DDL

I see. Thanks for clarifying that.

xiarixiaoyao · 2022-11-13T00:59:30Z

@pratyakshsharma
Thank you for your valuable comments, will addressed all comments.

codope

Super useful feature for the connector. Thanks @xiarixiaoyao for taking it up. It would be great if you could also add a few tests.

codope · 2022-11-14T16:16:05Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSessionProperties.java

+                stringProperty(
+                        HOODIE_FILESYSTEM_VIEW_SPILLABLE_DIR,
+                        "Path on local storage to use, when file system view is held in a spillable map.",
+                        "/tmp/",


Why can't this session property be part of HudiConfig as well?

codope · 2022-11-14T16:27:35Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+        HoodieTimeline activeTimeline = metaClient.reloadActiveTimeline();
+        Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+        // build system view.
+        fileSystemView = new HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);


This may not necessarily be HoodieMetadataFileSystemView. Should we use one of the FileSystemViewManager APIs to build the view based in metadata config?

codope · 2022-11-14T16:32:10Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+            log.warn(String.format("failed to do data skipping for table: %s, fallback to all files scan", metaClient.getBasePathV2().toString()), e);
+            candidateFileSlices = allInputFileSlices;
+        }
+        int candidateFileSize = candidateFileSlices.entrySet().stream().map(entry -> entry.getValue().size()).reduce(0, (n1, n2) -> n1 + n2);


Suggested change

int candidateFileSize = candidateFileSlices.entrySet().stream().map(entry -> entry.getValue().size()).reduce(0, (n1, n2) -> n1 + n2);

int candidateFileSize = candidateFileSlices.values().stream().map(List::size).reduce(0, Integer::sum);

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

codope · 2022-11-14T16:36:35Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+        int candidateFileSize = candidateFileSlices.entrySet().stream().map(entry -> entry.getValue().size()).reduce(0, (n1, n2) -> n1 + n2);
+        int totalFiles = allInputFileSlices.entrySet().stream().map(entry -> entry.getValue().size()).reduce(0, (n1, n2) -> n1 + n2);
+        double skippingPercent = totalFiles == 0 ? 0.0d : (totalFiles - candidateFileSize) / (totalFiles + 0.0d);
+        log.info(String.format("Total files: %s; candidate files after data skipping: %s; skipping percent %s",


Seems like these variables candidateFileSize and totalFiles are just for logging purpose? We can avoid churning of maps if it isn't strictly necessary.

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPredicates.java

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

codope · 2022-11-14T17:06:43Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+    {
+    }
+
+    public static SchemaEvolutionContext createSchemaEvolutionContext(HoodieTableMetaClient metaClient)


So this method gets called just once throughout the lifecycle of query right?
Maybe as a followup we can cache it by instant and make it visible for all queries to reduce the i/o load.

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

codope · 2022-11-14T17:10:43Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

+        Optional<HudiColumnHandle> columnHandleOpt = partitionColumnHandles.stream().filter(f -> f.getName().equals(partitionName)).findFirst();
+        if (columnHandleOpt.isPresent()) {
+            Domain domain = getDomain(columnHandleOpt.get(), partitionPathValue);
+            Domain columnPredicate = partitionPredicate.getDomains().get().get(partitionName);


partitionPredicate.getDomains() can be an empty optional.

but thats what L202 to 204 handles right?

xiarixiaoyao · 2022-11-15T09:24:28Z

@pratyakshsharma @codope
Thank you for viewing very much.
working on add ut

xiarixiaoyao · 2022-11-17T09:17:32Z

@pratyakshsharma @codope
add UT and addressed all comments .
could you pls help me review again, thanks

presto-hudi/pom.xml

xiarixiaoyao · 2022-11-17T09:26:18Z

presto-hudi/src/test/java/com/facebook/presto/hudi/TestHudiSkippingAndEvolution.java

+    // should remove this function, once we bump hudi to 0.13.0.
+    // old hudi-presto-bundle has not include lz4 and caffeine jar which is used by schema evolution and data-skipping.
+    private void shouldRemoved()
+    {


only used to pass ci, as we directly introduced lz4 and caffeine into the pom file

nsivabalan

Good job on the patch. results look amazing!

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

nsivabalan · 2022-11-20T06:56:31Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

+        Optional<HudiColumnHandle> columnHandleOpt = partitionColumnHandles.stream().filter(f -> f.getName().equals(partitionName)).findFirst();
+        if (columnHandleOpt.isPresent()) {
+            Domain domain = getDomain(columnHandleOpt.get(), partitionPathValue);
+            Domain columnPredicate = partitionPredicate.getDomains().get().get(partitionName);


but thats what L202 to 204 handles right?

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

nsivabalan · 2022-11-20T07:08:46Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+            return true;
+        }
+        for (String regularColumn : regularColumns) {
+            Domain columnPredicate = regularColumnPredicates.getDomains().get().get(regularColumn);


is it not handled in L204?

nsivabalan · 2022-11-20T07:11:50Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+        }
+        List<String> regularColumns = regularColumnPredicates.getDomains().get().entrySet().stream().map(Map.Entry::getKey).collect(Collectors.toList());
+        // get filter columns
+        List<String> encodedTargetColumnNames = regularColumns.stream().map(col -> new ColumnIndexID(col).asBase64EncodedString()).collect(Collectors.toList());


may be in a follow up PR. we should also wire in pruned list of partitions here, so that we prefix look up only in pruned partitions rather than all partitions. For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10 partitions are matched after pruning,

exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats partition in MDT to do file skipping.
where as if we wire in pruned list of partitions, then we only need to do file skipping from 50 entries.

guess we missed this even for spark impl in Hudi. will file a jira on this.

https://issues.apache.org/jira/browse/HUDI-5245

nsivabalan · 2022-11-20T07:19:31Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+            Domain columnPredicate = regularColumnPredicates.getDomains().get().get(regularColumn);
+            Optional<HoodieMetadataColumnStats> currentColumnStats = stats.stream().filter(s -> s.getColumnName().equals(regularColumn)).findFirst();
+            if (!currentColumnStats.isPresent()) {
+                // no stats for column


this should not happen right. can we throw here.

i donnot think so.
The index may be expired, at this time we must return true directly instead of throwing an exception

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

xiarixiaoyao · 2022-11-22T03:38:43Z

@pratyakshsharma @nsivabalan @codope
addressed all comments， could you pls help me review again，thanks

presto-hudi/pom.xml

codope · 2022-11-28T10:08:58Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPageSource.java

+            if (!columnCoercions.isEmpty() &&
+                    columnCoercions.containsKey(column.getName()) &&
+                    !column.getHiveType().equals(columnCoercions.get(column.getName()))) {
+                coercersBuilder.add(Optional.of(HiveCoercer.createCoercer(typeManager, columnCoercions.get(column.getName()), column.getHiveType())));


I see. Thanks for clarifying that.

codope · 2022-11-28T10:12:36Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

+            return Pair.of(oldColumnHandle, ImmutableMap.of());
+        }
+        // prune internalSchema: columns prune
+        InternalSchema prunedSchema = InternalSchemaUtils.pruneInternalSchema(internalSchema, oldColumnHandle.stream().map(HudiColumnHandle::getName).collect(Collectors.toList()));


We should throw an error as metastore is behind hudi table and needs to be synced again.

codope · 2022-11-28T10:15:15Z

presto-hudi/src/test/java/com/facebook/presto/hudi/AbstractHudiDistributedQueryTestBase.java

+            HoodieParquetRealtimeInputFormat.class.getName(),
+            MapredParquetOutputFormat.class.getName());
+
+    //     spark.sql(


Wondering if it's time to introduce the Java write client for testing purposes, instead of simulating commits this way. We already do it that way in Trino. I am ok with this change. We can take it up as a followup. But what do you think?

yes, i will try
thanks

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

7c00

Could we introduce data skipping and schema evolution in two separate PRs?

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

7c00 · 2022-11-23T03:37:59Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+
+        long duration = System.currentTimeMillis() - startTime;
+
+        log.info(String.format("prepare query files for table %s, spent: %d ms", metaClient.getTableConfig().getTableName(), duration));


In presto, we tend to use XXXStats to track method performance. For example, com.facebook.presto.hive.metastore.thrift.HiveMetastoreApiStats.

Thank you for your suggestion，

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSchemaEvolutionUtils.java

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPredicates.java

xiarixiaoyao · 2022-12-07T02:25:44Z

@7c00
Thank you for your review.
addressed all comments

pratyakshsharma

Thank you for patiently addressing all comments throughout. Few minor comments. Also wanted to know if you raised another PR for schema evolution? If so, please mention this PR there, so both the PRs are linked.

pratyakshsharma · 2022-12-21T15:15:43Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java

+                return true;
+            }
+
+            if (columnPredicate.intersect(domain).isNone()) {


can simplify it to return !columnPredicate.intersect(domain).isNone();

pratyakshsharma · 2022-12-21T15:37:18Z

presto-hudi/src/test/java/com/facebook/presto/hudi/TestHudiSkipping.java

+import static org.testng.Assert.assertEquals;
+
+/**
+ * Integration tests for reading Delta tables.


nit: This class is not intended for delta tables.

pratyakshsharma · 2022-12-21T15:40:31Z

presto-hudi/src/test/java/com/facebook/presto/hudi/AbstractHudiDistributedQueryTestBase.java

+                .build();
+
+        // setup file metastore
+


nit: remove extra line

pratyakshsharma · 2022-12-21T15:43:31Z

presto-hudi/src/test/java/com/facebook/presto/hudi/TestHudiSkipping.java

+        });
+    }
+
+    private HoodieTableQueryType getQueryType(String hudiInputFormat)


can we reuse this method from HudiSplitManager class?

pratyakshsharma · 2022-12-21T15:46:48Z

presto-hudi/src/test/java/com/facebook/presto/hudi/AbstractHudiDistributedQueryTestBase.java

+        }
+
+        // Create the test hudi tables for dataSkipping/partition prune in HMS
+        registerHudiTableInHMS(HoodieTableType.COPY_ON_WRITE, HUDI_SKIPPING_TABLE, testingDataDir, Streams.concat(HUDI_META_COLUMNS.stream(), DATA_COLUMNS.stream()).collect(Collectors.toList()));


I believe the tests only cover CoW table type. Let us add for MoR table type as well?

@xiarixiaoyao I guess test case for MoR type is still not added.

codope · 2022-12-23T15:38:33Z

@xiarixiaoyao @7c00 @pratyakshsharma This PR looks in pretty good shape and near landing now (except for last minor comments). It has also been well-tested both by @xiarixiaoyao and @nsivabalan on separate datasets. Would really appreciate if we can land this sooner.

nsivabalan · 2022-12-27T23:25:03Z

Hey folks, can we try to land this in 2022 :) would be good to close it out before end of this year.

xiarixiaoyao · 2023-01-04T02:46:23Z

@pratyakshsharma
Sorry for the late reply.
busy at work last month
The pr of schema evoluton will be raised tomorrow， thanks

pratyakshsharma · 2023-02-01T10:50:03Z

@xiarixiaoyao Is it good for another pass now?

xiarixiaoyao · 2023-02-03T09:06:37Z

@pratyakshsharma
Yes, it will be ok to review. thanks very much
if there are any new problems, I will modify them as soon as possible.

pratyakshsharma · 2023-02-23T09:14:28Z

@xiarixiaoyao I guess MoR test case is still missing. Can you please confirm?

vinothchandar

Few cursory comments. Happy to do a deeper pass, once we rebase this again on top of the async splits pr.

vinothchandar · 2023-03-01T23:22:27Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPredicates.java

+import java.util.Map;
+import java.util.Optional;
+
+public class HudiPredicates


can we unit test these classes? and the other new ones?

vinothchandar · 2023-03-01T23:23:59Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiSplitManager.java

+        boolean hudiMetadataTableEnabled = isHudiMetadataTableEnabled(session);
+        HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder().enable(hudiMetadataTableEnabled).build();
+        Configuration conf = fs.getConf();
+        HoodieTableMetaClient metaClient = HoodieTableMetaClient


would n't we otherwise create a metaClient here anyways? could we reuse instead of. creating a new one for data skipping alone?

vinothchandar · 2023-03-08T22:19:21Z

@codope One question I had was - does the hudi connector now leverage the metadata/alluxio caching that's in the hive connector? I had a deep dive with someone at Uber, and if that works, it could end up being a faster path (local, in-memory cache, maintained in parallel at the workers)

vinothchandar · 2023-03-21T01:35:49Z

@xiarixiaoyao any updates on this?

xiarixiaoyao · 2023-03-21T12:30:36Z

@xiarixiaoyao any updates on this?

I'm glad you're following this PR， will update this pr next few days. thanks

pratyakshsharma · 2023-03-22T19:08:46Z

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiFileSkippingManager.java

+        requireNonNull(partitions, "partitions is null");
+        requireNonNull(spillableDir, "spillableDir is null");
+        requireNonNull(engineContext, "engineContext is null");
+        this.queryType = requireNonNull(queryType, "queryType is null");


nit: Can we remove this variable since it is not getting used anywhere.

vinothchandar · 2023-05-02T13:03:47Z

@xiarixiaoyao Ping again :)

yihua · 2024-02-01T20:46:22Z

Hey @xiarixiaoyao Hope you're doing well. If you're busy, we can help rebase the PR on the latest master and drive it to completion.

xiarixiaoyao · 2024-02-05T09:08:48Z

Hey @xiarixiaoyao Hope you're doing well. If you're busy, we can help rebase the PR on the latest master and drive it to completion.

@yihua I'm sorry, I am quite busy currently. I'm glad you're interested in this PR. I hope you can continue this PR, thank you very much.

steveburnett · 2024-02-05T14:54:26Z

Consider revising the release note entry in the Description following the the release note guidelines.

== RELEASE NOTES ==
Hudi Connector Changes
* Add dataSkipping for Hudi connector.
* Add partition prune by Hudi MDT to reduce RPC calls for Hive.
* Add filter push down for Hudi COW table.

xiarixiaoyao requested review from vinothchandar, 7c00 and a team as code owners November 2, 2022 12:33

xiarixiaoyao requested a review from presto-oss November 2, 2022 12:33

pratyakshsharma added the backlog label Nov 3, 2022

pratyakshsharma requested changes Nov 11, 2022

View reviewed changes

pratyakshsharma requested changes Nov 12, 2022

View reviewed changes

codope reviewed Nov 14, 2022

View reviewed changes

pratyakshsharma added waiting-for-author and removed backlog labels Nov 15, 2022

xiarixiaoyao force-pushed the sp branch 2 times, most recently from 4efa4b5 to 45168cc Compare November 17, 2022 09:16

xiarixiaoyao commented Nov 17, 2022

View reviewed changes

presto-hudi/pom.xml Show resolved Hide resolved

xiarixiaoyao commented Nov 17, 2022

View reviewed changes

xiarixiaoyao force-pushed the sp branch 5 times, most recently from 21043e6 to a1dab79 Compare November 17, 2022 13:06

nsivabalan reviewed Nov 20, 2022

View reviewed changes

xiarixiaoyao force-pushed the sp branch from f488d50 to 6992e10 Compare November 21, 2022 13:19

codope reviewed Nov 28, 2022

View reviewed changes

codope reviewed Dec 3, 2022

View reviewed changes

presto-hudi/src/main/java/com/facebook/presto/hudi/HudiPartitionManager.java Show resolved Hide resolved

7c00 requested changes Dec 4, 2022

View reviewed changes

codope approved these changes Dec 6, 2022

View reviewed changes

7c00 requested changes Dec 6, 2022

View reviewed changes

xiarixiaoyao force-pushed the sp branch 2 times, most recently from bd62943 to 16d7cfd Compare December 7, 2022 07:30

pratyakshsharma requested changes Dec 21, 2022

View reviewed changes

xiarixiaoyao added 6 commits January 11, 2023 16:02

Support DataSkipping and schema evolution for hudi connector

ba4d36f

address comments and add UT

36b10ed

fix comments

03cde97

remove scheame evolution,and address comments

f53b9bd

fix code style

d716114

rebase code, address comments

8ed791f

xiarixiaoyao force-pushed the sp branch from 16d7cfd to 8ed791f Compare January 11, 2023 08:53

vinothchandar requested changes Mar 1, 2023

View reviewed changes

pratyakshsharma requested changes Mar 22, 2023

View reviewed changes

codope mentioned this pull request Jun 15, 2023

Data skipping for Hudi connector trinodb/trino#17899

Closed

tdcmeehan self-assigned this Feb 2, 2024

	int candidateFileSize = candidateFileSlices.entrySet().stream().map(entry -> entry.getValue().size()).reduce(0, (n1, n2) -> n1 + n2);
	int candidateFileSize = candidateFileSlices.values().stream().map(List::size).reduce(0, Integer::sum);


		long duration = System.currentTimeMillis() - startTime;

		log.info(String.format("prepare query files for table %s, spent: %d ms", metaClient.getTableConfig().getTableName(), duration));

Support DataSkipping for hudi connector #18606

Are you sure you want to change the base?

Support DataSkipping for hudi connector #18606

Conversation

xiarixiaoyao commented Nov 2, 2022 • edited Loading

linux-foundation-easycla bot commented Nov 2, 2022 • edited Loading

pratyakshsharma commented Nov 3, 2022

pratyakshsharma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratyakshsharma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Nov 13, 2022

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Nov 15, 2022

xiarixiaoyao commented Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan Nov 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Nov 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

7c00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Dec 7, 2022

pratyakshsharma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Nov 2, 2022 •

edited

Loading

linux-foundation-easycla bot commented Nov 2, 2022 •

edited

Loading

xiarixiaoyao Nov 16, 2022 •

edited

Loading

xiarixiaoyao commented Nov 17, 2022 •

edited

Loading

nsivabalan Nov 20, 2022 •

edited

Loading