Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3760] Adding capability to fetch Metadata Records by prefix #5208

Merged
merged 73 commits into from
Apr 6, 2022

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Apr 1, 2022

Tips

What is the purpose of the pull request

Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats Index records pertaining to the columns being queried by, instead of reading out whole Index.

Brief change log

  • Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
  • Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
  • Wiring key-prefix lookup t/h LogRecordScanner impls
  • Cleaning up HoodieHFileReader impl

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


default Iterator<R> getRecordIterator() throws IOException {
return getRecordIterator(getSchema());
}

default Option<R> getRecordByKey(String key, Schema readerSchema) throws IOException {
default Option<R> getRecordByKey(String key, Schema readerSchema, HFileScanner hFileScanner, Option<Schema.Field> keyFieldSchema) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to change these apis so that each caller uses its own HFileScanner. We have removed the class instance HFileScanner so that concurrent readers don't overstep each other and hence we could remove synchronized block within actual read methods.

@@ -239,6 +322,43 @@ private void initIfNeeded() {
return result;
}

private List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> readFromBaseAndMergeWithLogRecordsForKeyPrefixes(HoodieFileReader baseFileReader,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin : I tried to unify this and the other method. but there are quite a few places where I had to do if else branching. so, have left it as is. If you dont mind, can you take a stab at unifying. I have addressed every other feedback we discussed.

@nsivabalan nsivabalan changed the title [WIP] Adding capability to fetch Metadata Records by prefix [HUDI-3760] Adding capability to fetch Metadata Records by prefix Apr 4, 2022

HoodieTableMetadata tableMetadata = metadata(client);
// prefix search for column (_hoodie_record_key)
ColumnIndexID columnIndexID = new ColumnIndexID(HoodieRecord.RECORD_KEY_METADATA_FIELD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test to fetch multiple key prefixes? so far, all tests are trying to fetch just 1 key prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestColumnStatsIndex fetches multiple

(SerializableFunction<FileSlice, Iterator<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>>>) fileSlice -> {
// we are moving the readers to executors in this code path. So, reusing readers may not make sense.
Pair<HoodieFileReader, HoodieMetadataMergedLogRecordReader> readers =
openReadersIfNeeded(partitionName, fileSlice, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we need to fix openReadersIfNeeded for the fullScan/forceFullScan config. with more partitions, may be we want to enable full scan for FILES, but not for other partitions. So, we can't rely on a HoodieMetadataConfig to derive the value for forceFullScan. Each caller might have to set the right value when instantiating the log record reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let's create a ticket to not forget to follow up on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline: only "files" partition could be configured to do full-scan of the logs, while "column_stats", "bloom_filters" will have to go t/h point lookups

Alexey Kudinkin and others added 24 commits April 5, 2022 18:17
Modified `HoodieHFileDataBlock` to lookup records by key prefixes;
Tidying up
…ased on key-prefix rather than the full-key;

Added `HoodieMetadataMergedLogRecordRader::getRecordsByKeyPrefixes`
…instead of reading the whole table (if available)
@alexeykudinkin
Copy link
Contributor Author

CI in #5224 which is stack on top is green
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=7851&view=results

CI w/ Col Stats enabled by default is passing all, but a single validation that will be followed up separately
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=7850&view=results

// NOTE: Reader is ONLY THREAD-SAFE for {@code Scanner} operating in Positional Read ("pread")
// mode (ie created w/ "pread = true")
private final HFile.Reader reader;
// NOTE: Scanner caches read blocks, therefore it's important to re-use scanner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add a line to call out what are the flows which uses cached scanner. and what are the flows that uses its own scanner. if I am not wrong, getAllRecords uses cached scanner. where as getRecordsByKeys(point look ups) up and prefixed based look ups uses its own scanner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure i understand what your point is about? You mean who is using cachedScanner w/in the HFileReader itself, or who uses those flows APIs in turn?

this.reader = reader;
// For shared scanner, which is primarily used for point-lookups, we're caching blocks
// by default, to minimize amount of traffic to the underlying storage
this.sharedScanner = getHFileScanner(reader, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should also instantiate the scanner lazily ? similar to schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point -- init is very lightweight

@@ -53,38 +52,16 @@

private static final Logger LOG = LogManager.getLogger(HoodieMetadataMergedLogRecordReader.class);

// Set of all record keys that are to be read in memory
private Set<String> mergeKeyFilter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashantwason : we are removing mergeKeyFilter as we don't see any usage of it. wanted to confirm that its ok to remove it ?

return Collections.singletonList(Pair.of(key, Option.ofNullable((HoodieRecord) records.get(key))));
}

@SuppressWarnings("unchecked")
public List<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(List<String> keyPrefixes) {
// Following operations have to be atomic, otherwise concurrent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a checkState here that forceFullScan has to false.

@@ -390,8 +390,12 @@ private HoodieMetadataColumnStats combineColumnStatsMetadata(HoodieMetadataPaylo
return combineAndGetUpdateValue(oldRecord, schema, new Properties());
}

public Option<IndexedRecord> getInsertValue() throws IOException {
return getInsertValue(null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the purpose of this? payload impls are not supposed to have any additional public methods. so trying to understand the use-case for this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really needed, i think we can pass null from the caller itself and mention that generating metadata record doesn't rally depend on schema

// NOTE: We're allowing eager full-scan of the log-files only for "files" partition.
// Other partitions (like "column_stats", "bloom_filters") will have to be fetched
// t/h point-lookups
private boolean isFullScanAllowedForPartition(String partitionName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for FILES, can we honor scan log files config in HoodieMetadataConfig.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's done as you suggested and added a task to handle all partitions https://issues.apache.org/jira/browse/HUDI-3809

} catch (IOException ioe) {
throw new HoodieIOException("Error merging records from metadata table for " + keyPrefixes.size() + " key : ", ioe);
} finally {
close(Pair.of(partitionName, fileSlice.getFileId()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess we are missing to close the readers here. readers obtained within this method is not added to the hashmap maintained at class level. its a local copy. so, we have to clean them up here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right.. i've made the change.

@hudi-bot
Copy link

hudi-bot commented Apr 6, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

once CI is green, we are good to land this

@alexeykudinkin
Copy link
Contributor Author

Bot reports FAILURE, but build is Green

Screen Shot 2022-04-06 at 8 41 41 AM

Screen Shot 2022-04-06 at 8 41 28 AM

@nsivabalan nsivabalan merged commit 9e87d16 into apache:master Apr 6, 2022
xushiyan pushed a commit that referenced this pull request Apr 14, 2022
)

- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats 
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. 

Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
nsivabalan added a commit that referenced this pull request Jun 7, 2022
- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  #5208
yihua pushed a commit to yihua/hudi that referenced this pull request Jun 8, 2022
…e#5773)

- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  apache#5208
XuQianJin-Stars pushed a commit to XuQianJin-Stars/hudi that referenced this pull request Oct 14, 2022
…e#5773)

- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  apache#5208

(cherry picked from commit f85cd9b)
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
…che#37)

* [MINOR] Update alter rename command class type for pattern matching (apache#5381)

* [HUDI-3977] Flink hudi table with date type partition path throws HoodieNotSupportedException (apache#5432)

* Claim RFC 52 for Introduce Secondary Index to Improve HUDI Query Performance (apache#5441)

* [HUDI-3945] After the async compaction operation is complete, the task should exit. (apache#5391)

Co-authored-by: y00617041 <yangxuan42@huawei.com>

* [HUDI-3815] Fix docs description of metadata.compaction.delta_commits default value error (apache#5368)

Co-authored-by: pusheng.li01 <pusheng.li01@liulishuo.com>

* [HUDI-3943] Some description fixes for 0.10.1 docs (apache#5447)

* [MINOR] support different cleaning policy for flink (apache#5459)

* [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index  (apache#5185)

* fix duplicate fileId with bucket Index
* replace to load FileGroup from FileSystemView

* [MINOR] Fix CI by ignoring SparkContext error (apache#5468)

Sets spark.driver.allowMultipleContexts = true when constructing Spark conf in UtilHelpers

* [HUDI-3862] Fix default configurations of HoodieHBaseIndexConfig (apache#5308)

Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-3978] Fix use of partition path field as hive partition field in flink (apache#5434)

* Fix partition path fields as hive sync partition fields error

* [MINOR] Update DOAP for release 0.11.0 (apache#5467)

* [HUDI-3211][RFC-44] Add RFC for Hudi Connector for Presto (apache#4563)

* Add RFC doc

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* Add note regarding catalog naming

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [MINOR] Update RFC status (apache#5486)

* [HUDI-4005] Update release scripts to help validation (apache#5479)

* [HUDI-4031] Avoid clustering update handling when no pending replacecommit (apache#5487)

* [HUDI-3667] Run unit tests of hudi-integ-tests in CI (apache#5078)

* [MINOR] Optimize code logic (apache#5499)

* [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (apache#4264)

* [HUDI-4042] Support truncate-partition for Spark-3.2 (apache#5506)

* [HUDI-4017] Improve spark sql coverage in CI (apache#5512)

Add GitHub actions tasks to run spark sql UTs under spark 3.1 and 3.2.

* [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (apache#5073)

- Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever.
- Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds.

* [HUDI-3849] AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration (apache#5287)

* [MINOR] Fixing class not found when using flink and enable metadata table (apache#5527)

* [MINOR] fixing flaky tests in deltastreamer tests (apache#5521)

* [HUDI-4055]refactor ratelimiter to avoid stack overflow (apache#5530)

* [MINOR] Fixing close for HoodieCatalog's test (apache#5531)

* [MINOR] Fixing close for HoodieCatalog's test

* [HUDI-4053] Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOpti… (apache#5526)

* [HUDI-4053] Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOptimized

Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-3995] Making perf optimizations for bulk insert row writer path (apache#5462)

- Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen.
- Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord.
- Other minor fixes around using static values instead of looking up hashmap.

* [HUDI-4044] When reading data from flink-hudi to external storage, the … (apache#5516)


Co-authored-by: aliceyyan <aliceyyan@tencent.com>

* [HUDI-4003] Try to read all the log file to parse schema (apache#5473)

* [HUDI-4038] Avoid calling `getDataSize` after every record written (apache#5497)

- getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost.

Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4079] Supports showing table comment for hudi with spark3 (apache#5546)

* [HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (apache#5559)

* [HUDI-3963][Claim RFC number 53] Use Lock-Free Message Queue Improving Hoodie Writing Efficiency. (apache#5562)


Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (apache#5501)

- Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it.
- Added delete_partition support to integ test framework using spark-datasource.
- Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions)
- Added tests for 4 concurrent spark datasource writers (multi-writer tests).
- Fixed readme w/ sample commands for multi-writer.

* [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink (apache#5528)

* [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink

* [MINOR] Fix a NPE for Option (apache#5461)

* [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compact… (apache#5545)

* [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compaction files

* [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink (apache#5574)

* [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink

* [HUDI-4072] Fix NULL schema for empty batches in deltastreamer (apache#5543)

* [HUDI-4097] add table info to jobStatus (apache#5529)


Co-authored-by: wqwl611 <wqwl611@gmail.com>

* [HUDI-3980] Suport kerberos hbase index (apache#5464)

- Add configurations in HoodieHBaseIndexConfig.java to support kerberos hbase connection.

Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-4001] Filter the properties should not be used when create table for Spark SQL (apache#5495)

* fix hive sync no partition table error (apache#5585)

* [HUDI-3123] consistent hashing index: basic write path (upsert/insert) (apache#4480)

 1. basic write path(insert/upsert) implementation
 2. adapt simple bucket index

* [HUDI-4098] Metadata table heartbeat for instant has expired, last heartbeat 0 (apache#5583)

* [HUDI-4103] [HUDI-4001] Filter the properties should not be used when create table for Spark SQL

* [HUDI-3654] Preparations for hudi metastore. (apache#5572)

* [HUDI-3654] Preparations for hudi metastore.

Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>

* [HUDI-4104] DeltaWriteProfile includes the pending compaction file slice when deciding small buckets (apache#5594)

* [HUDI-4101] BucketIndexPartitioner should take partition path for better dispersion (apache#5590)

* [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand (apache#5564)

* [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand

* Set hoodie.query.as.ro.table in serde properties

* [HUDI-4110] Clean the marker files for flink compaction (apache#5604)

* [MINOR] Fixing spark long running yaml for non-partitioned (apache#5607)

* [minor] Some code refactoring for LogFileComparator and Instant instantiation (apache#5600)

* [HUDI-4109] Copy the old record directly when it is chosen for merging (apache#5603)

* Clean the marker files for flink compaction (apache#5611)

Co-authored-by: 854194341@qq.com <loukey_7821>

* [HUDI-3942] [RFC-50] Improve Timeline Server (apache#5392)

* [HUDI-4111] Bump ANTLR runtime version in Spark 3.x (apache#5606)

* Revert "[HUDI-3870] Add timeout rollback for flink online compaction (apache#5314)" (apache#5622)

This reverts commit 6f9b02d.

* [HUDI-4116] Unify clustering/compaction related procedures' output type (apache#5620)

* Unify clustering/compaction related procedures' output type

* Address review comments

* [HUDI-4114] Remove the unnecessary fs view sync for BaseWriteClient#initTable (apache#5617)

No need to #sync actively because the table instance is instantiated freshly,
its view manager has empty fiew instantces, the fs view would be synced lazily when
is it requested.

* [HUDI-4119] the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi (apache#5626)

* HUDI-4119 the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi

Co-authored-by: aliceyyan <aliceyyan@tencent.com>

* [HUDI-4130] Remove the upgrade/downgrade for flink #initTable (apache#5642)

* [HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (apache#5532)

* [MINOR] Minor fixes to exception log and removing unwanted metrics flush in integ test (apache#5646)

* [HUDI-4122] Fix NPE caused by adding kafka nodes (apache#5632)

* [MINOR] remove unused gson test dependency (apache#5652)

* [HUDI-3858] Shade javax.servlet for Spark bundle jar (apache#5295)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4100] CTAS failed to clean up when given an illegal MANAGED table definition (apache#5588)

* [HUDI-3890] fix rat plugin issue with sql files (apache#5644)

* [HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (apache#5517)

* [HUDI-4051] Allow nested field as preCombineField in spark sql

* relax validation for primary key

* [HUDI-4129] Initializes a new fs view for WriteProfile#reload (apache#5640)

Co-authored-by: zhangyuang <zhangyuang@corp.netease.com>

* [HUDI-4142] Claim RFC-54 for new table APIs (apache#5665)

* [HUDI-3933] Add UT cases to cover different key gen (apache#5638)

* [MINOR] Removing redundant semicolons and line breaks (apache#5662)

* [HUDI-4134] Fix Method naming consistency issues in FSUtils (apache#5655)

* [HUDI-4084] Add support to test async table services with integ test suite framework (apache#5557)

* Add support to test async table services with integ test suite framework

* Make await time for validation configurable

* [HUDI-4138] Fix the concurrency modification of hoodie table config for flink (apache#5660)

* Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected
* Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary
* Remove the modification of read code path in HoodieTableConfig

* [HUDI-2473] Fixing compaction write operation in commit metadata (apache#5203)

* [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (apache#5669)

* [HUDI-4135] remove netty and netty-all (apache#5663)

* [HUDI-2207] Support independent flink hudi clustering function

* [HUDI-4132] Fixing determining target table schema for delta sync with empty batch (apache#5648)

* [MINOR] Fix a potential NPE and some finer points of hudi cli (apache#5656)

* [HUDI-4146] Claim RFC-55 for Improve Hive/Meta sync class design and hierachies (apache#5682)

* [HUDI-3193] Decouple hudi-aws from hudi-client-common (apache#5666)

Move HoodieMetricsCloudWatchConfig to hudi-client-common

* [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (part2) (apache#5676)

* [HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (apache#5502)

* Along the lines of RDDCustomColumnsSortPartitioner but for Row

* [HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (apache#5641)

* [HUDI-4124] Add valid check in Spark Datasource configs (apache#5637)



Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>

* [HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (apache#5567)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4162] Fixed some constant mapping issues. (apache#5700)

Co-authored-by: y00617041 <yangxuan42@huawei.com>

* [HUDI-4161] Make sure partition values are taken from partition path (apache#5699)

* [MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (apache#5679)



Co-authored-by: Rex An <bonean131@gmail.com>

* [HUDI-4151] flink split_reader supports rocksdb (apache#5675)

* [HUDI-4151] flink split_reader supports rocksdb

* [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (apache#5697)

* [MINOR] Fix Hive and meta sync config for sql statement (apache#5316)

* [HUDI-4166] Added SimpleClient plugin for integ test (apache#5710)

* [HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (apache#4952)

* [HUDI-3551] Fix testStorageSchemes for oci storage (apache#5711)

* [HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (apache#5563)

Co-authored-by: 苏承祥 <sucx@tuya.com>

* [HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (apache#5703)

If the avro file is corrupted, an InvalidAvroMagicException throws.

* [HUDI-4149] Drop-Table fails when underlying table directory is broken (apache#5672)

* [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (apache#5597)

* added --sync-tool-classes config option in multitable delta streamer

* added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context

* [HUDI-4174] Add hive conf dir option for flink sink (apache#5725)

* [HUDI-4011] Add hudi-aws-bundle (apache#5674)



Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-3670] free temp views in sql transformers (apache#5080)

* [HUDI-4167] Remove the timeline refresh with initializing hoodie table (apache#5716)

The timeline refresh on table initialization invokes the fs view #sync, which has two actions now:

1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata
2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view

But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest,
the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally.

In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata
is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view:

1. if the fs view is local, the visibility is based on the client table metadata client's latest commit
2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline

From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not.
That make the client logic more clear and less error-prone.

Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the
remote fs view, the server would encounter conflicts and the client encounters a response error.

* [HUDI-4179] Cluster with sort cloumns invalid (apache#5739)

* [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (apache#5743)

* [HUDI-4187] Fix partition order in aws glue sync (apache#5731)

* [HUDI-4168] Add Call Procedure for marker deletion (apache#5738)

* Add Call Procedure for marker deletion

* [HUDI-4190] Include hbase-protocol for shading in the bundles (apache#5750)

* [HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (apache#5755)

SeekTo top cells avoid NullPointerException

* [HUDI-4188] Fix flaky ITTestDataSTreamWrite.testWriteCopyOnWrite (apache#5749)

* [HUDI-4195] Bulk insert should use right keygen for non-partitioned table (apache#5759)

* [HUDI-4101] When BucketIndexPartitioner take partition path for dispersion may cause the fileID of the task to not be loaded correctly (apache#5763)

Co-authored-by: john.wick <john.wick@vipshop.com>

* [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (apache#5733)

As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;

* [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (apache#5664)

Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen.  This patch fixes the issue.

* [HUDI-4197] Fix Async indexer to support building FILES partition (apache#5766)

- When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.

* [HUDI-4171] Fixing Non partitioned with virtual keys in read path (apache#5747)

- When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.

* [MINOR] Mark AWSGlueCatalogSyncClient experimental (apache#5775)

* [MINOR][RFC-53] Fix typos (apache#5764)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4200] Fixing sorting of keys fetched from metadata table (apache#5773)

- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  apache#5208

* [HUDI-4198] Fix hive config for AWSGlueClientFactory (apache#5768)

* HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory

* Resolve metastore uri config before loading fs conf

* Skip hiveql due to CI issue

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (apache#5737)

There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.

* [MINOR][DOCS] Update the README.md file in hudi-examples (apache#5803)

* [MINOR] FlinkStateBackendConverter add more  exception message (apache#5809)

* [MINOR] FlinkStateBackendConverter add more  exception message

* [HUDI-4213] Infer keygen clazz for Spark SQL (apache#5815)

* [HUDI-4139]improvement for flink write operator name to identify tables easily (apache#5744)


Co-authored-by: yanenze <yanenze@keytop.com.cn>

* [HUDI-3889] Do not validate table config if save mode is set to Overwrite (apache#5619)


Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (apache#5829)

* [HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (apache#5840)

When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.

* [HUDI-4205] Fix NullPointerException in HFile reader creation (apache#5841)

Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers

* [HUDI-4224] Fix CI issues (apache#5842)

- Upgrade junit to 5.7.2
- Downgrade surefire and failsafe to 2.22.2
- Fix test failures that were previously not reported
- Improve azure pipeline configs

Co-authored-by: liujinhui1994 <965147871@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [MINOR]  fix AvroSchemaConverter duplicate branch in 'switch' (apache#5813)

* Strip extra spaces when creating new configuration (apache#5849)

Co-authored-by: superche <superche@tencent.com>

* [HUDI-3682] testReaderFilterRowKeys fails in TestHoodieOrcReaderWriter (apache#5790)

TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields.

This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing.

Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-3863] Add UT for drop partition column in deltastreamer testsuite (apache#5727)

* [HUDI-4006] failOnDataLoss on delta-streamer kafka sources (apache#5718)

add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss
when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently
when failOnDataLoss is set, fail explicitly

* [HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (apache#5788)

Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception

* [MINOR] Fix typo of DisruptorExecutor in RFC 53 (apache#5860)

* [minor] Following HUDI-4207, remote the new wrapper #init method (apache#5865)

* [HUDI-4255] Make the flink merge and replace handle intermediate file visible (apache#5866)

* [HUDI-3499] Add Call Procedure for show rollbacks (apache#5848)

* Add Call Procedure for show rollbacks

* fix

* add ut for show_rollback_detail and exception handle

Co-authored-by: superche <superche@tencent.com>

* [HUDI-4218] [HUDI-4218] Expose the real exception information when an exception occurs in the tableExists method (apache#5827)

* [HUDI-4217] improve repeat init object in ExpressionPayload (apache#5825)

* [HUDI-4214] improve repeat init write schema in ExpressionPayload (apache#5820)

* [HUDI-4214] improve repeat init write schema in ExpressionPayload

* [HUDI-4265] Deprecate useless targetTableName parameter in HoodieMultiTableDeltaStreamer (apache#5883)

* [HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (apache#5761)

* Support Create/Drop/Show/Refresh Index Syntax for Spark SQL

* [HUDI-3507] Support export command based on Call Produce Command (apache#5901)

* [HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (apache#5894)

* [MINOR] Add "spillable_map_path" in FlinkCompactionConfig. To avoid the disk space of "/tmp" full when compacting offline. (apache#5905)

* [HUDI-4277] supoort flink table source with computed column (apache#5897)

Co-authored-by: chenshizhi <chenshizhi@bilibili.com>

* fix remove redundant Variable (apache#5806)

* [HUDI-4259]  Flink create avro schema not conformance to standards (apache#5878)

* flink create avro schema not conformance to standards

Co-authored-by: 854194341@qq.com <loukey_7821>

* [HUDI-4258] Fix when HoodieTable removes data file before the end of Flink job (apache#5876)

* [HUDI-4258] Fix when HoodieTable removes data file before the end of Flink job

* [MINOR] Update DOAP with 0.11.1 Release (apache#5908)

* [HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (apache#5723)

* [HUDI-4251] Fix the problem that the command 'commits sync' description does not match. (apache#5881)

* [HUDI-4177] Fix hudi-cli rollback with rollbackUsingMarkers method call (apache#5734)

* Fix hudi-cli rollback with rollbackUsingMarkers method call
* Add test for hudi-cli rollbackUsingMarkers

Co-authored-by: Shawn Chang <yxchang@amazon.com>

* [HUDI-4270] Bootstrap op data loading missing (apache#5888)

* [HUDI-3475] Initialize hudi table management module.

* udate

* Revert master (apache#5925)

* Revert "udate"

This reverts commit 092e35c.

* Revert "[HUDI-3475] Initialize hudi table management module."

This reverts commit 4640a3b.

* [HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (apache#5917)

Signed-off-by: LinMingQiang <1356469429@qq.com>

* [minor] following 4270, add unit tests for the keys lost case (apache#5918)

* [HUDI-3508] Add call procedure for FileSystemViewCommand (apache#5929)

* [HUDI-3508] Add call procedure for FileSystemView

* minor

Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com>

* [HUDI-4299] Fix problem about hudi-example-java run failed on idea. (apache#5936)

* [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (apache#5941)

* [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups

* Separate out incremental sync fsview test with clustering

* [HUDI-3509] Add call procedure for HoodieLogFileCommand (apache#5949)

Co-authored-by: zhanshaoxiong <jiimmyzhan@tencent.com>

* [HUDI-4273] Support inline schedule clustering for Flink stream (apache#5890)

* [HUDI-4273] Support inline schedule clustering for Flink stream

* delete deprecated clustering plan strategy and add clustering ITTest

* [HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (apache#5874)

* [HUDI-4260] Change KEYGEN_CLASS_NAME without default value (apache#5877)

* Change KEYGEN_CLASS_NAME without default value

Co-authored-by: 854194341@qq.com <loukey_7821>

* [HUDI-3512] Add call procedure for StatsCommand (apache#5955)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)

* Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)" (apache#5971)

This reverts commit e8fbd4d.

* [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (apache#5966)

* Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)

* [HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (apache#5973)

* [HUDI-3502] Support hdfs parquet import command based on Call Produce Command (apache#5956)

* [MINOR] Remove -T option from CI build (apache#5972)

* [HUDI-5246] Bumping mysql connector version due to security vulnerability (apache#5851)

* [HUDI-4309] Spark3.2 custom parser should not throw exception (apache#5947)

* [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (apache#5959)

* [HUDI-4315] Do not throw exception in BaseSpark3Adapter#toTableIdentifier (apache#5957)

* [HUDI-3504] Support bootstrap command based on Call Produce Command (apache#5977)

* [HUDI-4311] Fix Flink lose data on some rollback scene (apache#5950)

* [HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (apache#5930)

* [HUDI-3506] Add call procedure for CommitsCommand (apache#5974)

* [HUDI-3506] Add call procedure for CommitsCommand

Co-authored-by: superche <superche@tencent.com>

* [HUDI-4325] fix spark sql procedure cause ParseException with semicolon (apache#5982)

* [HUDI-4325] fix saprk sql procedure cause ParseException with semicolon

* [HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (apache#5990)

* [HUDI-4332] The current instant may be wrong under some extreme conditions in AppendWriteFunction. (apache#5988)

* [HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (apache#5970)

Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller

* [HUDI-1176] Upgrade hudi to log4j2 (apache#5366)

* Move to log4j2

cr: https://code.amazon.com/reviews/CR-71010705

* Upgrade unit tests to log4j2

* update exclusion

Co-authored-by: Brandon Scheller <bschelle@amazon.com>

* [HUDI-4334] close SparkRDDWriteClient after usage in Create/Delete/RollbackSavepointsProcedure (apache#5994)

* [HUDI-1575] Claim RFC-56: Early Conflict Detection For Multi-writer (apache#6002)

Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net>

* [MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false as default (apache#5174)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4331] Allow loading external config file from class loader (apache#5987)

Co-authored-by: Wenning Ding <wenningd@amazon.com>

* [HUDI-4336] Fix records overwritten bug with binary primary key (apache#5996)

* [MINOR] Following apache#2070, Fix BindException when running tests on shared machines. (apache#5951)

* [HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (apache#5999)

* [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria… (apache#5907)

* [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer

* add ut

Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>

* [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458)

* [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048)

Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file

* [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445)

Co-authored-by: jerryyue <jerryyue@didiglobal.com>

* [HUDI-4353] Column stats data skipping for flink (apache#6026)

* [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012)

Co-authored-by: superche <superche@tencent.com>

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854)

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-3511] Add call procedure for MetadataCommand (apache#6018)

* [HUDI-3730] Add ConfigTool#toMap UT (apache#6035)

Co-authored-by: voonhou.su <voonhou.su@shopee.com>

* [MINOR] Improve variable names (apache#6039)

* [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043)

* [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286)

Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042)

* [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029)

* [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828)

* [HUDI-4357] Support flink 1.15.x (apache#6050)

* [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677)

* [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once

* [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted

* [HUDI-4152] Provider UT & IT for compact multi compaction plan

* [HUDI-4152] Put multi compaction plans into one compaction plan source

* [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma

* [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy

* [HUDI-4309] fix spark32 repartition error (apache#6033)

* [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051)

* [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060)

* [HUDI-4367] Support copyToTable on call (apache#6054)

* [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995)

* fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check

* Fix - serde parameters getting overrided on table property update

* removing stale syncConfig

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017)

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>

* [HUDI-3500] Add call procedure for RepairsCommand (apache#6053)

* [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061)

- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig

* [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062)

Bumps xalan from 2.7.1 to 2.7.2.

---
updated-dependencies:
- dependency-name: xalan:xalan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)

* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead

* [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695)

* [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies

Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4323] Make database table names optional in sync tool (apache#6073)

* [HUDI-4323] Make database table names optional in sync tool
* Infer from these properties from the table config

* [MINOR] Update RFCs status (apache#6078)

* [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937)

* [HUDI-4298] Add test case for reading mor table

Signed-off-by: LinMingQiang <1356469429@qq.com>

* [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080)

* [HUDI-4391] Incremental read from archived commits for flink (apache#6096)

* [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436)



Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103)

* [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112)

* [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106)

Co-authored-by: jerryyue <jerryyue@didiglobal.com>

* [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119)

* [HUDI-4403] Fix the end input metadata for bounded source (apache#6116)

* [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120)

* [HUDI-3503]  Add call procedure for CleanCommand (apache#6065)

* [HUDI-3503] Add call procedure for CleanCommand
Co-authored-by: simonssu <simonssu@tencent.com>

* [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily  (apache#5855)

* [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722)

* Rebase codes

* Move listFileSlices to HoodieBaseRelation

* Fix review

* Fix style

* Fix bug

* Remove a few files that were removed in upstream master

* Fix build issues

Co-authored-by: KnightChess <981159963@qq.com>
Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com>
Co-authored-by: huberylee <shibei.lh@foxmail.com>
Co-authored-by: watermelon12138 <49849410+watermelon12138@users.noreply.github.com>
Co-authored-by: y00617041 <yangxuan42@huawei.com>
Co-authored-by: Ibson <pushengli@163.com>
Co-authored-by: pusheng.li01 <pusheng.li01@liulishuo.com>
Co-authored-by: LiChuang <64473732+CodeCooker17@users.noreply.github.com>
Co-authored-by: Gary Li <yanjia.gary.li@gmail.com>
Co-authored-by: 吴祥平 <408317717@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: xicm <36392121+xicm@users.noreply.github.com>
Co-authored-by: xicm <xicm@asiainfo.com>
Co-authored-by: Wangyh <763941163@qq.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
Co-authored-by: Todd Gao <todd.gao.2013@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: qianchutao <72595723+qianchutao@users.noreply.github.com>
Co-authored-by: guanziyue <30882822+guanziyue@users.noreply.github.com>
Co-authored-by: Jin Xing <jinxing.corey@gmail.com>
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: BruceLin <brucekellan@gmail.com>
Co-authored-by: ForwardXu <forwardxu315@gmail.com>
Co-authored-by: aliceyyan <104287562+aliceyyan@users.noreply.github.com>
Co-authored-by: aliceyyan <aliceyyan@tencent.com>
Co-authored-by: Lanyuanxiaoyao <lanyuanxiaoyao@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey@infinilake.com>
Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com>
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Bo Cui <cuibo0108@163.com>
Co-authored-by: Xingcan Cui <xcui@wealthsimple.com>
Co-authored-by: wqwl611 <67826098+wqwl611@users.noreply.github.com>
Co-authored-by: wqwl611 <wqwl611@gmail.com>
Co-authored-by: 董可伦 <dongkelun01@inspur.com>
Co-authored-by: 陈浩 <bettermouse94@gmail.com>
Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com>
Co-authored-by: Shawy Geng <gengxiaoyu1996@gmail.com>
Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>
Co-authored-by: luokey <854194341@qq.com>
Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com>
Co-authored-by: wangxianghu <wangxianghu@apache.org>
Co-authored-by: uday08bce <uday08bce@gmail.com>
Co-authored-by: YuangZhang <z_yuang@foxmail.com>
Co-authored-by: zhangyuang <zhangyuang@corp.netease.com>
Co-authored-by: felixYyu <felix2003@live.cn>
Co-authored-by: Heap <35054152+h1ap@users.noreply.github.com>
Co-authored-by: liujinhui <965147871@qq.com>
Co-authored-by: luoyajun <luoyajun1010@gmail.com>
Co-authored-by: 冯健 <fengjian428@gmail.com>
Co-authored-by: RexAn <anh131@126.com>
Co-authored-by: komao <masterwangzx@gmail.com>
Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
Co-authored-by: Rex An <bonean131@gmail.com>
Co-authored-by: Carter Shanklin <cartershanklin@users.noreply.github.com>
Co-authored-by: 苏承祥 <scx_white@aliyun.com>
Co-authored-by: 苏承祥 <sucx@tuya.com>
Co-authored-by: Kumud Kumar Srivatsava Tirupati <kumudkumartirupati@users.noreply.github.com>
Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com>
Co-authored-by: leesf <490081539@qq.com>
Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net>
Co-authored-by: Saisai Shao <sai.sai.shao@gmail.com>
Co-authored-by: marchpure <marchpure@126.com>
Co-authored-by: HunterXHunter <1356469429@qq.com>
Co-authored-by: john.wick <john.wick@vipshop.com>
Co-authored-by: liuzhuang2017 <95120044+liuzhuang2017@users.noreply.github.com>
Co-authored-by: sandyfog <154525105@qq.com>
Co-authored-by: yanenze <34880077+yanenze@users.noreply.github.com>
Co-authored-by: yanenze <yanenze@keytop.com.cn>
Co-authored-by: superche <73096722+hechao-ustc@users.noreply.github.com>
Co-authored-by: superche <superche@tencent.com>
Co-authored-by: 5herhom <35916131+5herhom@users.noreply.github.com>
Co-authored-by: Shizhi Chen <107476116+chenshzh@users.noreply.github.com>
Co-authored-by: chenshizhi <chenshizhi@bilibili.com>
Co-authored-by: Alexander Trushev <42293632+trushev@users.noreply.github.com>
Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com>
Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com>
Co-authored-by: Shawn Chang <yxchang@amazon.com>
Co-authored-by: jiz <31836510+microbearz@users.noreply.github.com>
Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com>
Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
Co-authored-by: xiarixiaoyao <mengtao0326@qq.com>
Co-authored-by: bschell <bdscheller@gmail.com>
Co-authored-by: Brandon Scheller <bschelle@amazon.com>
Co-authored-by: Teng <teng_huo@outlook.com>
Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net>
Co-authored-by: wenningd <wenningding95@gmail.com>
Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: miomiocat <284487410@qq.com>
Co-authored-by: JerryYue-M <272614347@qq.com>
Co-authored-by: jerryyue <jerryyue@didiglobal.com>
Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: voonhous <voonhousu@gmail.com>
Co-authored-by: voonhou.su <voonhou.su@shopee.com>
Co-authored-by: shenjiayu17 <54424149+shenjiayu17@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Luning (Lucas) Wang <rsl4@foxmail.com>
Co-authored-by: Yann Byron <biyan900116@gmail.com>
Co-authored-by: Tim Brown <tim.brown126@gmail.com>
Co-authored-by: simonsssu <barley0806@gmail.com>
Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants