[HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency #5567

zhangyue19921010 · 2022-05-12T10:56:59Z

https://issues.apache.org/jira/browse/HUDI-3963

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

zhangyue19921010 · 2022-05-12T10:57:19Z

Related PR is #5416 here.

leesf · 2022-05-16T13:40:42Z

rfc/rfc-53/rfc-53.md

+However, this lock model may become the bottleneck of application throughput when data volume is much larger. 
+What's worse is that even if we increase the number of the executors, it is still difficult to improve the throughput.
+
+In other words, users may encounter throughput bottlenecks when writing data into hudi in some scenarios, and increasing the physical hardware scale and parameter tuning cannot solve this problem.


in some scenerios, can we describe the scenerios in detail, such as the scenarios your company met?

Yeap, add more details in this rfc. Thanks a lot for your review :)

leesf · 2022-05-16T13:45:44Z

rfc/rfc-53/rfc-53.md

+  Default value is `BOUNDED_IN_MEMORY_EXECUTOR` which used a bounded in-memory queue `LinkedBlockingQueue`. 
+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.


is there a recommended value for the buffer size?

Actually, the default/recommended value is 1024. And added in RFC.

leesf · 2022-05-16T13:46:48Z

rfc/rfc-53/rfc-53.md

+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2.
+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.


would you please describe the strategy in more detail and how it works?

Yeap, there are four kinds of strategies here:

BlockingWaitStrategy

SleepingWaitStrategy

YieldingWaitStrategy

BusySpinWaitStrategy

Also added the implementation details and suitable use cases.
Actually these strategies are built in Disruptor and we want to expose this para to users through hoodie.write.wait.strategy

leesf · 2022-05-16T13:47:10Z

rfc/rfc-53/rfc-53.md

+  - `hoodie.write.wait.strategy`: Strategy employed for making DisruptorExecutor wait for a cursor.
+
+4. limitation
+  For now, this disruptor executor is only supported for spark insert and spark bulk insert operation. Other operations like spark upsert is still on going.


other engine such as flink is still on going?

Yes. Flink related writing is still on going as this feature is still in experimental.
But it will not take much effort to migrate when spark ingestion is ready :)

AFAIK, flink does not use spark's producer/consumer model to write the data. So I am wondering how the plan rollout for Flink engine.

By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue.

Hi @YuweiXiao Thanks a lot for your attention.
It seems that flink use something like FlinkLazyInsertIterable which hold BoundedInMemoryExecutor to do ingestion works same as spark.
So that maybe we can provide this new disruptorExecutor compared with BoundedInMemoryExecutor.

By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue.

Nice Catch! It may happen when using bulk_insert as single producer and single consumer.
Yes, I can have a try as another option

Oh ok, didn't realize FlinkLazyInsertIterable. But looking at its usage, I found the removal of producer/consumer may be suitable for Flink engine. The producer is simply a iterator of List<HoodieRecord>, and the consumption is single-thread too.

By the way, have you tried to remove the producer/consumer at all? The writing actually is a blocking single producer - single consumer cases. I guess it could perform better in some cases, as we save all the overhead of the message queue.

Nice Catch! It may happen when using bulk_insert as single producer and single consumer. Yes, I can have a try as another option

Can we also reflect it on the RFC ?

Sure will reflect it in this RFC ASAP.
Thanks

zhangyue19921010 · 2022-05-17T01:41:19Z

Hi @leesf Thanks a lot for your review. Really appreciate it !
All comments are addressed. PTAL

zhangyue19921010 · 2022-05-17T15:02:34Z

@hudi-bot run azure

leesf · 2022-05-20T14:19:59Z

rfc/rfc-53/rfc-53.md

+## Proposers
+@zhangyue19921010
+
+## Approvers


you can add me(leesf) as the approvers

Sure. Really appreciate it. Thanks @leesf

leesf · 2022-05-20T14:20:47Z

rfc/rfc-53/rfc-53.md

+  Also users could use `DISRUPTOR_EXECUTOR`, which use disruptor as a lock free message queue to gain better writing performance. 
+  Although `DISRUPTOR_EXECUTOR` is still an experimental feature.
+  - `hoodie.write.buffer.size`: The size of the Disruptor Executor ring buffer, must be power of 2. Also the default/recommended value is 1024.
+  - `hoodie.write.wait.strategy`: Used for disruptor wait strategy. The Wait Strategy determines how a consumer will wait for events to be placed into the Disruptor by a producer. 


any other config needed to expose to end users as well?

I believe these configs are enough :)

zhangyue19921010 · 2022-05-23T03:30:43Z

Hi @leesf. Comments are addressed. PTAL. Thanks :)

hudi-bot · 2022-05-23T06:37:53Z

CI report:

6edd958 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…oodie Writing Efficiency (apache#5567) Co-authored-by: yuezhang <yuezhang@freewheel.tv>

zhangyue19921010 · 2022-05-30T02:35:54Z

Thanks a lot for your review @leesf
Really appreciate it if you can review the related pr :)

danny0405 · 2022-05-30T03:28:08Z

rfc/rfc-53/rfc-53.md

+- Define `DisruptorPublisher` to register producers into Disruptor and control the produce behaviors including life cycle.
+- Define `DisruptorMessageHandler` to register consumers into Disruptor and write consumption data from disruptor to hudi data file. 
+For example we will clear clear out the event after processing it to avoid to avoid unnecessary memory and GC pressure
+- Define `HoodieDisruptorEvent` as the carrier of the hoodie message


clear clear out
to avoid to avoid

syntax error

…iting Efficiency (#5416) https://issues.apache.org/jira/browse/HUDI-3963 RFC design : #5567 Add Lock-Free executor to improve hoodie writing throughput and optimize execution efficiency. Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction. Existing BoundedInMemory is the default. Users can enable on a need basis. Co-authored-by: yuezhang <yuezhang@freewheel.tv>

…iting Efficiency (apache#5416) https://issues.apache.org/jira/browse/HUDI-3963 RFC design : apache#5567 Add Lock-Free executor to improve hoodie writing throughput and optimize execution efficiency. Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction. Existing BoundedInMemory is the default. Users can enable on a need basis. Co-authored-by: yuezhang <yuezhang@freewheel.tv>

…che#37) * [MINOR] Update alter rename command class type for pattern matching (apache#5381) * [HUDI-3977] Flink hudi table with date type partition path throws HoodieNotSupportedException (apache#5432) * Claim RFC 52 for Introduce Secondary Index to Improve HUDI Query Performance (apache#5441) * [HUDI-3945] After the async compaction operation is complete, the task should exit. (apache#5391) Co-authored-by: y00617041 <yangxuan42@huawei.com> * [HUDI-3815] Fix docs description of metadata.compaction.delta_commits default value error (apache#5368) Co-authored-by: pusheng.li01 <pusheng.li01@liulishuo.com> * [HUDI-3943] Some description fixes for 0.10.1 docs (apache#5447) * [MINOR] support different cleaning policy for flink (apache#5459) * [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index (apache#5185) * fix duplicate fileId with bucket Index * replace to load FileGroup from FileSystemView * [MINOR] Fix CI by ignoring SparkContext error (apache#5468) Sets spark.driver.allowMultipleContexts = true when constructing Spark conf in UtilHelpers * [HUDI-3862] Fix default configurations of HoodieHBaseIndexConfig (apache#5308) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-3978] Fix use of partition path field as hive partition field in flink (apache#5434) * Fix partition path fields as hive sync partition fields error * [MINOR] Update DOAP for release 0.11.0 (apache#5467) * [HUDI-3211][RFC-44] Add RFC for Hudi Connector for Presto (apache#4563) * Add RFC doc Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * Add note regarding catalog naming Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [MINOR] Update RFC status (apache#5486) * [HUDI-4005] Update release scripts to help validation (apache#5479) * [HUDI-4031] Avoid clustering update handling when no pending replacecommit (apache#5487) * [HUDI-3667] Run unit tests of hudi-integ-tests in CI (apache#5078) * [MINOR] Optimize code logic (apache#5499) * [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (apache#4264) * [HUDI-4042] Support truncate-partition for Spark-3.2 (apache#5506) * [HUDI-4017] Improve spark sql coverage in CI (apache#5512) Add GitHub actions tasks to run spark sql UTs under spark 3.1 and 3.2. * [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (apache#5073) - Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever. - Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds. * [HUDI-3849] AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration (apache#5287) * [MINOR] Fixing class not found when using flink and enable metadata table (apache#5527) * [MINOR] fixing flaky tests in deltastreamer tests (apache#5521) * [HUDI-4055]refactor ratelimiter to avoid stack overflow (apache#5530) * [MINOR] Fixing close for HoodieCatalog's test (apache#5531) * [MINOR] Fixing close for HoodieCatalog's test * [HUDI-4053] Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOpti… (apache#5526) * [HUDI-4053] Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOptimized Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-3995] Making perf optimizations for bulk insert row writer path (apache#5462) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap. * [HUDI-4044] When reading data from flink-hudi to external storage, the … (apache#5516) Co-authored-by: aliceyyan <aliceyyan@tencent.com> * [HUDI-4003] Try to read all the log file to parse schema (apache#5473) * [HUDI-4038] Avoid calling `getDataSize` after every record written (apache#5497) - getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost. Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4079] Supports showing table comment for hudi with spark3 (apache#5546) * [HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (apache#5559) * [HUDI-3963][Claim RFC number 53] Use Lock-Free Message Queue Improving Hoodie Writing Efficiency. (apache#5562) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (apache#5501) - Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it. - Added delete_partition support to integ test framework using spark-datasource. - Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions) - Added tests for 4 concurrent spark datasource writers (multi-writer tests). - Fixed readme w/ sample commands for multi-writer. * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink (apache#5528) * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink * [MINOR] Fix a NPE for Option (apache#5461) * [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compact… (apache#5545) * [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compaction files * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink (apache#5574) * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink * [HUDI-4072] Fix NULL schema for empty batches in deltastreamer (apache#5543) * [HUDI-4097] add table info to jobStatus (apache#5529) Co-authored-by: wqwl611 <wqwl611@gmail.com> * [HUDI-3980] Suport kerberos hbase index (apache#5464) - Add configurations in HoodieHBaseIndexConfig.java to support kerberos hbase connection. Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4001] Filter the properties should not be used when create table for Spark SQL (apache#5495) * fix hive sync no partition table error (apache#5585) * [HUDI-3123] consistent hashing index: basic write path (upsert/insert) (apache#4480) 1. basic write path(insert/upsert) implementation 2. adapt simple bucket index * [HUDI-4098] Metadata table heartbeat for instant has expired, last heartbeat 0 (apache#5583) * [HUDI-4103] [HUDI-4001] Filter the properties should not be used when create table for Spark SQL * [HUDI-3654] Preparations for hudi metastore. (apache#5572) * [HUDI-3654] Preparations for hudi metastore. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com> * [HUDI-4104] DeltaWriteProfile includes the pending compaction file slice when deciding small buckets (apache#5594) * [HUDI-4101] BucketIndexPartitioner should take partition path for better dispersion (apache#5590) * [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand (apache#5564) * [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand * Set hoodie.query.as.ro.table in serde properties * [HUDI-4110] Clean the marker files for flink compaction (apache#5604) * [MINOR] Fixing spark long running yaml for non-partitioned (apache#5607) * [minor] Some code refactoring for LogFileComparator and Instant instantiation (apache#5600) * [HUDI-4109] Copy the old record directly when it is chosen for merging (apache#5603) * Clean the marker files for flink compaction (apache#5611) Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-3942] [RFC-50] Improve Timeline Server (apache#5392) * [HUDI-4111] Bump ANTLR runtime version in Spark 3.x (apache#5606) * Revert "[HUDI-3870] Add timeout rollback for flink online compaction (apache#5314)" (apache#5622) This reverts commit 6f9b02d. * [HUDI-4116] Unify clustering/compaction related procedures' output type (apache#5620) * Unify clustering/compaction related procedures' output type * Address review comments * [HUDI-4114] Remove the unnecessary fs view sync for BaseWriteClient#initTable (apache#5617) No need to #sync actively because the table instance is instantiated freshly, its view manager has empty fiew instantces, the fs view would be synced lazily when is it requested. * [HUDI-4119] the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi (apache#5626) * HUDI-4119 the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi Co-authored-by: aliceyyan <aliceyyan@tencent.com> * [HUDI-4130] Remove the upgrade/downgrade for flink #initTable (apache#5642) * [HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (apache#5532) * [MINOR] Minor fixes to exception log and removing unwanted metrics flush in integ test (apache#5646) * [HUDI-4122] Fix NPE caused by adding kafka nodes (apache#5632) * [MINOR] remove unused gson test dependency (apache#5652) * [HUDI-3858] Shade javax.servlet for Spark bundle jar (apache#5295) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4100] CTAS failed to clean up when given an illegal MANAGED table definition (apache#5588) * [HUDI-3890] fix rat plugin issue with sql files (apache#5644) * [HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (apache#5517) * [HUDI-4051] Allow nested field as preCombineField in spark sql * relax validation for primary key * [HUDI-4129] Initializes a new fs view for WriteProfile#reload (apache#5640) Co-authored-by: zhangyuang <zhangyuang@corp.netease.com> * [HUDI-4142] Claim RFC-54 for new table APIs (apache#5665) * [HUDI-3933] Add UT cases to cover different key gen (apache#5638) * [MINOR] Removing redundant semicolons and line breaks (apache#5662) * [HUDI-4134] Fix Method naming consistency issues in FSUtils (apache#5655) * [HUDI-4084] Add support to test async table services with integ test suite framework (apache#5557) * Add support to test async table services with integ test suite framework * Make await time for validation configurable * [HUDI-4138] Fix the concurrency modification of hoodie table config for flink (apache#5660) * Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected * Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary * Remove the modification of read code path in HoodieTableConfig * [HUDI-2473] Fixing compaction write operation in commit metadata (apache#5203) * [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (apache#5669) * [HUDI-4135] remove netty and netty-all (apache#5663) * [HUDI-2207] Support independent flink hudi clustering function * [HUDI-4132] Fixing determining target table schema for delta sync with empty batch (apache#5648) * [MINOR] Fix a potential NPE and some finer points of hudi cli (apache#5656) * [HUDI-4146] Claim RFC-55 for Improve Hive/Meta sync class design and hierachies (apache#5682) * [HUDI-3193] Decouple hudi-aws from hudi-client-common (apache#5666) Move HoodieMetricsCloudWatchConfig to hudi-client-common * [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (part2) (apache#5676) * [HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (apache#5502) * Along the lines of RDDCustomColumnsSortPartitioner but for Row * [HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (apache#5641) * [HUDI-4124] Add valid check in Spark Datasource configs (apache#5637) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> * [HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (apache#5567) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4162] Fixed some constant mapping issues. (apache#5700) Co-authored-by: y00617041 <yangxuan42@huawei.com> * [HUDI-4161] Make sure partition values are taken from partition path (apache#5699) * [MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (apache#5679) Co-authored-by: Rex An <bonean131@gmail.com> * [HUDI-4151] flink split_reader supports rocksdb (apache#5675) * [HUDI-4151] flink split_reader supports rocksdb * [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (apache#5697) * [MINOR] Fix Hive and meta sync config for sql statement (apache#5316) * [HUDI-4166] Added SimpleClient plugin for integ test (apache#5710) * [HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (apache#4952) * [HUDI-3551] Fix testStorageSchemes for oci storage (apache#5711) * [HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (apache#5563) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (apache#5703) If the avro file is corrupted, an InvalidAvroMagicException throws. * [HUDI-4149] Drop-Table fails when underlying table directory is broken (apache#5672) * [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (apache#5597) * added --sync-tool-classes config option in multitable delta streamer * added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context * [HUDI-4174] Add hive conf dir option for flink sink (apache#5725) * [HUDI-4011] Add hudi-aws-bundle (apache#5674) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-3670] free temp views in sql transformers (apache#5080) * [HUDI-4167] Remove the timeline refresh with initializing hoodie table (apache#5716) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error. * [HUDI-4179] Cluster with sort cloumns invalid (apache#5739) * [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (apache#5743) * [HUDI-4187] Fix partition order in aws glue sync (apache#5731) * [HUDI-4168] Add Call Procedure for marker deletion (apache#5738) * Add Call Procedure for marker deletion * [HUDI-4190] Include hbase-protocol for shading in the bundles (apache#5750) * [HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (apache#5755) SeekTo top cells avoid NullPointerException * [HUDI-4188] Fix flaky ITTestDataSTreamWrite.testWriteCopyOnWrite (apache#5749) * [HUDI-4195] Bulk insert should use right keygen for non-partitioned table (apache#5759) * [HUDI-4101] When BucketIndexPartitioner take partition path for dispersion may cause the fileID of the task to not be loaded correctly (apache#5763) Co-authored-by: john.wick <john.wick@vipshop.com> * [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (apache#5733) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy; * [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (apache#5664) Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen. This patch fixes the issue. * [HUDI-4197] Fix Async indexer to support building FILES partition (apache#5766) - When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met. * [HUDI-4171] Fixing Non partitioned with virtual keys in read path (apache#5747) - When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist. * [MINOR] Mark AWSGlueCatalogSyncClient experimental (apache#5775) * [MINOR][RFC-53] Fix typos (apache#5764) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4200] Fixing sorting of keys fetched from metadata table (apache#5773) - Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock. - Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix apache#5208 * [HUDI-4198] Fix hive config for AWSGlueClientFactory (apache#5768) * HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory * Resolve metastore uri config before loading fs conf * Skip hiveql due to CI issue Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (apache#5737) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic. * [MINOR][DOCS] Update the README.md file in hudi-examples (apache#5803) * [MINOR] FlinkStateBackendConverter add more exception message (apache#5809) * [MINOR] FlinkStateBackendConverter add more exception message * [HUDI-4213] Infer keygen clazz for Spark SQL (apache#5815) * [HUDI-4139]improvement for flink write operator name to identify tables easily (apache#5744) Co-authored-by: yanenze <yanenze@keytop.com.cn> * [HUDI-3889] Do not validate table config if save mode is set to Overwrite (apache#5619) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (apache#5829) * [HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (apache#5840) When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior. * [HUDI-4205] Fix NullPointerException in HFile reader creation (apache#5841) Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers * [HUDI-4224] Fix CI issues (apache#5842) - Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [MINOR] fix AvroSchemaConverter duplicate branch in 'switch' (apache#5813) * Strip extra spaces when creating new configuration (apache#5849) Co-authored-by: superche <superche@tencent.com> * [HUDI-3682] testReaderFilterRowKeys fails in TestHoodieOrcReaderWriter (apache#5790) TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields. This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing. Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-3863] Add UT for drop partition column in deltastreamer testsuite (apache#5727) * [HUDI-4006] failOnDataLoss on delta-streamer kafka sources (apache#5718) add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently when failOnDataLoss is set, fail explicitly * [HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (apache#5788) Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception * [MINOR] Fix typo of DisruptorExecutor in RFC 53 (apache#5860) * [minor] Following HUDI-4207, remote the new wrapper #init method (apache#5865) * [HUDI-4255] Make the flink merge and replace handle intermediate file visible (apache#5866) * [HUDI-3499] Add Call Procedure for show rollbacks (apache#5848) * Add Call Procedure for show rollbacks * fix * add ut for show_rollback_detail and exception handle Co-authored-by: superche <superche@tencent.com> * [HUDI-4218] [HUDI-4218] Expose the real exception information when an exception occurs in the tableExists method (apache#5827) * [HUDI-4217] improve repeat init object in ExpressionPayload (apache#5825) * [HUDI-4214] improve repeat init write schema in ExpressionPayload (apache#5820) * [HUDI-4214] improve repeat init write schema in ExpressionPayload * [HUDI-4265] Deprecate useless targetTableName parameter in HoodieMultiTableDeltaStreamer (apache#5883) * [HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (apache#5761) * Support Create/Drop/Show/Refresh Index Syntax for Spark SQL * [HUDI-3507] Support export command based on Call Produce Command (apache#5901) * [HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (apache#5894) * [MINOR] Add "spillable_map_path" in FlinkCompactionConfig. To avoid the disk space of "/tmp" full when compacting offline. (apache#5905) * [HUDI-4277] supoort flink table source with computed column (apache#5897) Co-authored-by: chenshizhi <chenshizhi@bilibili.com> * fix remove redundant Variable (apache#5806) * [HUDI-4259] Flink create avro schema not conformance to standards (apache#5878) * flink create avro schema not conformance to standards Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-4258] Fix when HoodieTable removes data file before the end of Flink job (apache#5876) * [HUDI-4258] Fix when HoodieTable removes data file before the end of Flink job * [MINOR] Update DOAP with 0.11.1 Release (apache#5908) * [HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (apache#5723) * [HUDI-4251] Fix the problem that the command 'commits sync' description does not match. (apache#5881) * [HUDI-4177] Fix hudi-cli rollback with rollbackUsingMarkers method call (apache#5734) * Fix hudi-cli rollback with rollbackUsingMarkers method call * Add test for hudi-cli rollbackUsingMarkers Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4270] Bootstrap op data loading missing (apache#5888) * [HUDI-3475] Initialize hudi table management module. * udate * Revert master (apache#5925) * Revert "udate" This reverts commit 092e35c. * Revert "[HUDI-3475] Initialize hudi table management module." This reverts commit 4640a3b. * [HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (apache#5917) Signed-off-by: LinMingQiang <1356469429@qq.com> * [minor] following 4270, add unit tests for the keys lost case (apache#5918) * [HUDI-3508] Add call procedure for FileSystemViewCommand (apache#5929) * [HUDI-3508] Add call procedure for FileSystemView * minor Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com> * [HUDI-4299] Fix problem about hudi-example-java run failed on idea. (apache#5936) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (apache#5941) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups * Separate out incremental sync fsview test with clustering * [HUDI-3509] Add call procedure for HoodieLogFileCommand (apache#5949) Co-authored-by: zhanshaoxiong <jiimmyzhan@tencent.com> * [HUDI-4273] Support inline schedule clustering for Flink stream (apache#5890) * [HUDI-4273] Support inline schedule clustering for Flink stream * delete deprecated clustering plan strategy and add clustering ITTest * [HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (apache#5874) * [HUDI-4260] Change KEYGEN_CLASS_NAME without default value (apache#5877) * Change KEYGEN_CLASS_NAME without default value Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-3512] Add call procedure for StatsCommand (apache#5955) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * [TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948) * Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)" (apache#5971) This reverts commit e8fbd4d. * [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (apache#5966) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint) * [HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (apache#5973) * [HUDI-3502] Support hdfs parquet import command based on Call Produce Command (apache#5956) * [MINOR] Remove -T option from CI build (apache#5972) * [HUDI-5246] Bumping mysql connector version due to security vulnerability (apache#5851) * [HUDI-4309] Spark3.2 custom parser should not throw exception (apache#5947) * [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (apache#5959) * [HUDI-4315] Do not throw exception in BaseSpark3Adapter#toTableIdentifier (apache#5957) * [HUDI-3504] Support bootstrap command based on Call Produce Command (apache#5977) * [HUDI-4311] Fix Flink lose data on some rollback scene (apache#5950) * [HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (apache#5930) * [HUDI-3506] Add call procedure for CommitsCommand (apache#5974) * [HUDI-3506] Add call procedure for CommitsCommand Co-authored-by: superche <superche@tencent.com> * [HUDI-4325] fix spark sql procedure cause ParseException with semicolon (apache#5982) * [HUDI-4325] fix saprk sql procedure cause ParseException with semicolon * [HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (apache#5990) * [HUDI-4332] The current instant may be wrong under some extreme conditions in AppendWriteFunction. (apache#5988) * [HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (apache#5970) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller * [HUDI-1176] Upgrade hudi to log4j2 (apache#5366) * Move to log4j2 cr: https://code.amazon.com/reviews/CR-71010705 * Upgrade unit tests to log4j2 * update exclusion Co-authored-by: Brandon Scheller <bschelle@amazon.com> * [HUDI-4334] close SparkRDDWriteClient after usage in Create/Delete/RollbackSavepointsProcedure (apache#5994) * [HUDI-1575] Claim RFC-56: Early Conflict Detection For Multi-writer (apache#6002) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> * [MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false as default (apache#5174) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4331] Allow loading external config file from class loader (apache#5987) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4336] Fix records overwritten bug with binary primary key (apache#5996) * [MINOR] Following apache#2070, Fix BindException when running tests on shared machines. (apache#5951) * [HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (apache#5999) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria… (apache#5907) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer * add ut Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> * [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458) * [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file * [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [HUDI-4353] Column stats data skipping for flink (apache#6026) * [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012) Co-authored-by: superche <superche@tencent.com> * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854) * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-3511] Add call procedure for MetadataCommand (apache#6018) * [HUDI-3730] Add ConfigTool#toMap UT (apache#6035) Co-authored-by: voonhou.su <voonhou.su@shopee.com> * [MINOR] Improve variable names (apache#6039) * [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043) * [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042) * [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029) * [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828) * [HUDI-4357] Support flink 1.15.x (apache#6050) * [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677) * [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once * [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted * [HUDI-4152] Provider UT & IT for compact multi compaction plan * [HUDI-4152] Put multi compaction plans into one compaction plan source * [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma * [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy * [HUDI-4309] fix spark32 repartition error (apache#6033) * [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051) * [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060) * [HUDI-4367] Support copyToTable on call (apache#6054) * [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995) * fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check * Fix - serde parameters getting overrided on table property update * removing stale syncConfig * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com> * [HUDI-3500] Add call procedure for RepairsCommand (apache#6053) * [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig * [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062) Bumps xalan from 2.7.1 to 2.7.2. --- updated-dependencies: - dependency-name: xalan:xalan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead * [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695) * [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4323] Make database table names optional in sync tool (apache#6073) * [HUDI-4323] Make database table names optional in sync tool * Infer from these properties from the table config * [MINOR] Update RFCs status (apache#6078) * [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937) * [HUDI-4298] Add test case for reading mor table Signed-off-by: LinMingQiang <1356469429@qq.com> * [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080) * [HUDI-4391] Incremental read from archived commits for flink (apache#6096) * [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103) * [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112) * [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119) * [HUDI-4403] Fix the end input metadata for bounded source (apache#6116) * [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120) * [HUDI-3503] Add call procedure for CleanCommand (apache#6065) * [HUDI-3503] Add call procedure for CleanCommand Co-authored-by: simonssu <simonssu@tencent.com> * [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (apache#5855) * [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722) * Rebase codes * Move listFileSlices to HoodieBaseRelation * Fix review * Fix style * Fix bug * Remove a few files that were removed in upstream master * Fix build issues Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: huberylee <shibei.lh@foxmail.com> Co-authored-by: watermelon12138 <49849410+watermelon12138@users.noreply.github.com> Co-authored-by: y00617041 <yangxuan42@huawei.com> Co-authored-by: Ibson <pushengli@163.com> Co-authored-by: pusheng.li01 <pusheng.li01@liulishuo.com> Co-authored-by: LiChuang <64473732+CodeCooker17@users.noreply.github.com> Co-authored-by: Gary Li <yanjia.gary.li@gmail.com> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: xicm <36392121+xicm@users.noreply.github.com> Co-authored-by: xicm <xicm@asiainfo.com> Co-authored-by: Wangyh <763941163@qq.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: Todd Gao <todd.gao.2013@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: qianchutao <72595723+qianchutao@users.noreply.github.com> Co-authored-by: guanziyue <30882822+guanziyue@users.noreply.github.com> Co-authored-by: Jin Xing <jinxing.corey@gmail.com> Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com> Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com> Co-authored-by: BruceLin <brucekellan@gmail.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: aliceyyan <104287562+aliceyyan@users.noreply.github.com> Co-authored-by: aliceyyan <aliceyyan@tencent.com> Co-authored-by: Lanyuanxiaoyao <lanyuanxiaoyao@gmail.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Bo Cui <cuibo0108@163.com> Co-authored-by: Xingcan Cui <xcui@wealthsimple.com> Co-authored-by: wqwl611 <67826098+wqwl611@users.noreply.github.com> Co-authored-by: wqwl611 <wqwl611@gmail.com> Co-authored-by: 董可伦 <dongkelun01@inspur.com> Co-authored-by: 陈浩 <bettermouse94@gmail.com> Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com> Co-authored-by: Shawy Geng <gengxiaoyu1996@gmail.com> Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com> Co-authored-by: luokey <854194341@qq.com> Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com> Co-authored-by: wangxianghu <wangxianghu@apache.org> Co-authored-by: uday08bce <uday08bce@gmail.com> Co-authored-by: YuangZhang <z_yuang@foxmail.com> Co-authored-by: zhangyuang <zhangyuang@corp.netease.com> Co-authored-by: felixYyu <felix2003@live.cn> Co-authored-by: Heap <35054152+h1ap@users.noreply.github.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: luoyajun <luoyajun1010@gmail.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: RexAn <anh131@126.com> Co-authored-by: komao <masterwangzx@gmail.com> Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> Co-authored-by: Rex An <bonean131@gmail.com> Co-authored-by: Carter Shanklin <cartershanklin@users.noreply.github.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: 苏承祥 <sucx@tuya.com> Co-authored-by: Kumud Kumar Srivatsava Tirupati <kumudkumartirupati@users.noreply.github.com> Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com> Co-authored-by: leesf <490081539@qq.com> Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net> Co-authored-by: Saisai Shao <sai.sai.shao@gmail.com> Co-authored-by: marchpure <marchpure@126.com> Co-authored-by: HunterXHunter <1356469429@qq.com> Co-authored-by: john.wick <john.wick@vipshop.com> Co-authored-by: liuzhuang2017 <95120044+liuzhuang2017@users.noreply.github.com> Co-authored-by: sandyfog <154525105@qq.com> Co-authored-by: yanenze <34880077+yanenze@users.noreply.github.com> Co-authored-by: yanenze <yanenze@keytop.com.cn> Co-authored-by: superche <73096722+hechao-ustc@users.noreply.github.com> Co-authored-by: superche <superche@tencent.com> Co-authored-by: 5herhom <35916131+5herhom@users.noreply.github.com> Co-authored-by: Shizhi Chen <107476116+chenshzh@users.noreply.github.com> Co-authored-by: chenshizhi <chenshizhi@bilibili.com> Co-authored-by: Alexander Trushev <42293632+trushev@users.noreply.github.com> Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: jiz <31836510+microbearz@users.noreply.github.com> Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: xiarixiaoyao <mengtao0326@qq.com> Co-authored-by: bschell <bdscheller@gmail.com> Co-authored-by: Brandon Scheller <bschelle@amazon.com> Co-authored-by: Teng <teng_huo@outlook.com> Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: wenningd <wenningding95@gmail.com> Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: miomiocat <284487410@qq.com> Co-authored-by: JerryYue-M <272614347@qq.com> Co-authored-by: jerryyue <jerryyue@didiglobal.com> Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: voonhou.su <voonhou.su@shopee.com> Co-authored-by: shenjiayu17 <54424149+shenjiayu17@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Luning (Lucas) Wang <rsl4@foxmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: Tim Brown <tim.brown126@gmail.com> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>

* [HUDI-4282] Repair IOException in CHDFS when check block corrupted in HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4757] Create pyspark examples (apache#6672) * [HUDI-3959] Rename class name for spark rdd reader (apache#5409) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4828] Fix the extraction of record keys which may be cut out (apache#6650) Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4873] Report number of messages to be processed via metrics (apache#6271) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4870] Improve compaction config description (apache#6706) * [HUDI-3304] Support partial update payload (apache#4676) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… (apache#6630) * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in log file issue Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4485] Bump spring shell to 2.1.1 in CLI (apache#6489) Bumped spring shell to 2.1.1 and updated the default value for show fsview all `pathRegex` parameter. * [minor] following 3304, some code refactoring (apache#6713) * [HUDI-4832] Fix drop partition meta sync (apache#6662) * [HUDI-4810] Fix log4j imports to use bridge API (apache#6710) Co-authored-by: dongsj <dongsj@asiainfo.com> * [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (apache#6717) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (apache#5920) - This pull request fix [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 apache#5861* - The issue is caused by after changing the table to spark data source table, the table SerDeInfo is missing. * Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [MINOR] fix indent to make build pass (apache#6721) * [HUDI-3478] Implement CDC Write in Spark (apache#6697) * [HUDI-4326] Fix hive sync serde properties (apache#6722) * [HUDI-4875] Fix NoSuchTableException when dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2 (apache#6709) * [DOCS] Improve the quick start guide for Kafka Connect Sink (apache#6708) * [HUDI-4729] Fix file group pending compaction cannot be queried when query _ro table (apache#6516) File group in pending compaction can not be queried when query _ro table with spark. This commit fixes that. Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [HUDI-3983] Fix ClassNotFoundException when using hudi-spark-bundle to write table with hbase index (apache#6715) * [HUDI-4758] Add validations to java spark examples (apache#6615) * [HUDI-4792] Batch clean files to delete (apache#6580) This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4363] Support Clustering row writer to improve performance (apache#6046) * [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data (apache#6734) * [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator (apache#6739) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-3901] Correct the description of hoodie.index.type (apache#6749) * [MINOR] Add .mvn directory to gitignore (apache#6746) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * add support for unraveling proto schemas * fix some compile issues * [HUDI-4901] Add avro.version to Flink profiles (apache#6757) * Add avro.version to Flink profiles Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4559] Support hiveSync command based on Call Produce Command (apache#6322) * [HUDI-4883] Supporting delete savepoint for MOR (apache#6744) Users could delete unnecessary savepoints and unblock archival for MOR table. * [HUDI-4897] Refactor the merge handle in CDC mode (apache#6740) * [HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031) * Revert "[HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)" (apache#6768) This reverts commit 092375f. * [HUDI-3523] Introduce AddPrimitiveColumnSchemaPostProcessor to support add new primitive column to the end of a schema (apache#6769) * [HUDI-4903] Fix TestHoodieLogFormat`s minor typo (apache#6762) * [MINOR] Drastically reducing concurrency level (to avoid CI flakiness) (apache#6754) * Update HoodieIndex.java Fix a typo * [HUDI-4906] Fix the local tests for hudi-flink (apache#6763) * [HUDI-4899] Fixing compatibility w/ Spark 3.2.2 (apache#6755) * [HUDI-4892] Fix hudi-spark3-bundle (apache#6735) * [MINOR] Fix a few typos in HoodieIndex (apache#6784) Co-authored-by: xingjunwang <xingjunwang@tencent.com> * [HUDI-4412] Fix multi writer INSERT_OVERWRITE NPE bug (apache#6130) There are two minor issues fixed here: 1. When the insert_overwrite operation is performed, the clusteringPlan in the requestedReplaceMetadata will be null. Calling getFileIdsFromRequestedReplaceMetadata will cause NPE. 2. When insert_overwrite operation, inflightCommitMetadata!=null, getOperationType should be obtained from getHoodieInflightReplaceMetadata, the original code will have a null pointer. * [MINOR] retain avro's namespace (apache#6783) * [MINOR] Simple logging fix in LockManager (apache#6765) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4433] hudi-cli repair deduplicate not working with non-partitioned dataset (apache#6349) When using the repair deduplicate command with hudi-cli, there is no way to run it on the unpartitioned dataset, so modify the cli parameter. Co-authored-by: Xingjun Wang <wongxingjun@126.com> * [RFC-51][HUDI-3478] Update RFC: CDC support (apache#6256) * [HUDI-4915] improve avro serializer/deserializer (apache#6788) * [HUDI-3478] Implement CDC Read in Spark (apache#6727) * naming and style updates * [HUDI-4830] Fix testNoGlobalConfFileConfigured when add hudi-defaults.conf in default dir (apache#6652) * make test data random, reuse code * [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering (apache#6561) - Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD which ended up retriggering the action. Fixing the double de-referencing in this patch. * [HUDI-4914] Managed memory weight should be set when sort clustering is enabled (apache#6792) * [HUDI-4910] Fix unknown variable or type "Cast" (apache#6778) * [HUDI-4918] Fix bugs about when trying to show the non -existing key from env, NullPointException occurs. (apache#6794) * [HUDI-4718] Add Kerberos kinit command support. (apache#6719) * add test for 2 different recursion depths, fix schema cache key * add unsigned long support * better handle other types * rebase on 4904 * get all tests working * fix oneof expected schema, update tests after rebase * [HUDI-4902] Set default partitioner for SIMPLE BUCKET index (apache#6759) * [MINOR] Update PR template with documentation update (apache#6748) * revert scala binary change * try a different method to avoid avro version * [HUDI-4904] Add support for unraveling proto schemas in ProtoClassBasedSchemaProvider (apache#6761) If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data. Main changes to existing code: ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array Converting from a proto to an avro now accounts for this truncation of the input * delete unused file * [HUDI-4907] Prevent single commit multi instant issue (apache#6766) Co-authored-by: TengHuo <teng_huo@outlook.com> Co-authored-by: yuzhao.cyz <yuzhao.cyz@gmail.com> * [HUDI-4923] Fix flaky TestHoodieReadClient.testReadFilterExistAfterBulkInsertPrepped (apache#6801) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4848] Fixing repair deprecated partition tool (apache#6731) * [HUDI-4913] Fix HoodieSnapshotExporter for writing to a different S3 bucket or FS (apache#6785) * address PR feedback, update decimal precision * fix isNullable issue, check if class is Int64value * checkstyle fix * change wrapper descriptor set initialization * add in testing for unsigned long to BigInteger conversion * [HUDI-4453] Fix schema to include partition columns in bootstrap operation (apache#6676) Turn off the type inference of the partition column to be consistent with existing behavior. Add notes around partition column type inference. * [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data (apache#4015) Co-authored-by: huangjing02 <huangjing02@bilibili.com> Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4924] Auto-tune dedup parallelism (apache#6802) * [HUDI-4687] Avoid setAccessible which breaks strong encapsulation (apache#6657) Use JOL GraphLayout for estimating deep size. * [MINOR] fixing validate async operations to poll completed clean instances (apache#6814) * [HUDI-4734] Deltastreamer table config change validation (apache#6753) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4934] Revert batch clean files (apache#6813) * Revert "[HUDI-4792] Batch clean files to delete (apache#6580)" This reverts commit cbf9b83. * [HUDI-4722] Added locking metrics for Hudi (apache#6502) * [HUDI-4936] Fix `as.of.instant` not recognized as hoodie config (apache#5616) Co-authored-by: leon <leon@leondeMacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4861] Relaxing `MERGE INTO` constraints to permit limited casting operations w/in matched-on conditions (apache#6820) * [HUDI-4885] Adding org.apache.avro to hudi-hive-sync bundle (apache#6729) * [HUDI-4951] Fix incorrect use of Long.getLong() (apache#6828) * [MINOR] Use base path URI in ITTestDataStreamWrite (apache#6826) * [HUDI-4308] READ_OPTIMIZED read mode will temporary loss of data when compaction (apache#6664) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4237] Fixing empty partition-values being sync'd to HMS (apache#6821) Co-authored-by: dujunling <dujunling@bytedance.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (apache#6355) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4850] Add incremental source from GCS to Hudi (apache#6665) Adds an incremental source from GCS based on a similar design as https://hudi.apache.org/blog/2021/08/23/s3-events-source * [HUDI-4957] Shade JOL in bundles to fix NoClassDefFoundError:GraphLayout (apache#6839) * [HUDI-4718] Add Kerberos kdestroy command support (apache#6810) * [HUDI-4916] Implement change log feed for Flink (apache#6840) * [HUDI-4769] Option read.streaming.skip_compaction skips delta commit (apache#6848) * [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row (apache#6805) * [HUDI-4966] Add a partition extractor to handle partition values with slashes (apache#6851) * [MINOR] Fix testUpdateRejectForClustering (apache#6852) * [HUDI-4962] Move cloud dependencies to cloud modules (apache#6846) * [HOTFIX] Fix source release validate script (apache#6865) * [HUDI-4980] Calculate avg record size using commit only (apache#6864) Calculate average record size for Spark upsert partitioner based on commit instants only. Previously it's based on commit and replacecommit, of which the latter may be created by clustering which has inaccurately smaller average record sizes, which could result in OOM due to size underestimation. * shade protobuf dependency * Revert "[HUDI-4915] improve avro serializer/deserializer (apache#6788)" (apache#6809) This reverts commit 79b3e2b. * [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (apache#6857) * Enhancing README for multi-writer tests (apache#6870) * [MINOR] Fix deploy script for flink 1.15 (apache#6872) * [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (apache#6883) * Revert "shade protobuf dependency" This reverts commit f03f961. * [HUDI-4972] Fixes to make unit tests work on m1 mac (apache#6751) * [HUDI-2786] Docker demo on mac aarch64 (apache#6859) * [HUDI-4971] Fix shading kryo-shaded with reusing configs (apache#6873) * [HUDI-3900] [UBER] Support log compaction action for MOR tables (apache#5958) - Adding log compaction support to MOR table. subsequent log blocks can now be compacted into larger log blocks without needing to go for full compaction (by merging w/ base file). - New timeline action is introduced for the purpose. Co-authored-by: sivabalan <n.siva.b@gmail.com> * Relocate apache http package (apache#6874) * [HUDI-4975] Fix datahub bundle dependency (apache#6896) * [HUDI-4999] Refactor FlinkOptions#allOptions and CatalogOptions#allOptions (apache#6901) * [MINOR] Update GitHub setting for merge button (apache#6922) Only allow squash and merge. Disable merge and rebase * [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool (apache#6885) * [MINOR] Fix name spelling for RunBootstrapProcedure * [HUDI-4754] Add compliance check in github actions (apache#6575) * [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion (apache#6847) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> * [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities (apache#6886) * Implement Create/Drop/Show/Refresh Secondary Index (apache#5933) * [MINOR] Moved readme from .github to the workflows folder (apache#6932) * [HUDI-4952] Fixing reading from metadata table when there are no inflight commits (apache#6836) * Fixing reading from metadata table when there are no inflight commits * Fixing reading from metadata if not fully built out * addressing minor comments * fixing sql conf and options interplay * addressing minor refactoring * [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer (apache#6003) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5006] Use the same wrapper for timestamp type metadata for parquet and log files (apache#6918) Before this patch, for timestamp type, we use LongWrapper for parquet and TimestampMicrosWrapper for avro log, they may keep different precision val here, for example, with timestamp(3), LongWrapper keeps the val as a millisecond long from EPOCH instant, while TimestampMicrosWrapper keeps the val as micro-seconds. For spark, it uses micro-seconds internally for timestamp type value, while flink uses the TimestampData internally, we better keeps the same precision for better compatibility here. * [HUDI-5016] Flink clustering does not reserve commit metadata (apache#6929) * [HUDI-3900] Fixing hdfs setup and tear down in tests to avoid flakiness (apache#6912) * [HUDI-5002] Remove deprecated API usage in SparkHoodieHBaseIndex#generateStatement (apache#6909) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5010] Fix flink hive catalog external config not work (apache#6923) * fix flink catalog external config not work * [HUDI-4948] Improve CDC Write (apache#6818) * improve cdc write to support multiple log files * update: use map to store the cdc stats * [HUDI-5030] Fix TestPartialUpdateAvroPayload.testUseLatestRecordMetaValue(apache#6948) * [HUDI-5033] Fix Broken Link In MultipleSparkJobExecutionStrategy (apache#6951) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5037] Upgrade org.apache.thrift:libthrift to 0.14.0 (apache#6941) * [MINOR] Fixing verbosity of docker set up (apache#6944) * [HUDI-5022] Make better error messages for pr compliance (apache#6934) * [HUDI-5003] Fix the type of InLineFileSystem`startOffset to long (apache#6916) * [HUDI-4855] Add missing table configs for bootstrap in Deltastreamer (apache#6694) * [MINOR] Handling null event time (apache#6876) * [MINOR] Update DOAP with 0.12.1 Release (apache#6988) * [MINOR] Increase maxParameters size in scalastyle (apache#6987) * [HUDI-3900] Closing resources in TestHoodieLogRecord (apache#6995) * [MINOR] Test case for hoodie.merge.allow.duplicate.on.inserts (apache#6949) * [HUDI-4982] Add validation job for spark bundles in GitHub Actions (apache#6954) * [HUDI-5041] Fix lock metric register confict error (apache#6968) Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4998] Infer partition extractor class first from meta sync partition fields (apache#6899) * [HUDI-4781] Allow omit metadata fields for hive sync (apache#6471) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4997] Use jackson-v2 import instead of jackson-v1 (apache#6893) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-3900] Fixing tempDir usage in TestHoodieLogFormat (apache#6981) * [HUDI-4995] Relocate httpcomponents (apache#6906) * [MINOR] Update GitHub setting for branch protection (apache#7008) - require at least 1 approving review * [HUDI-4960] Upgrade jetty version for timeline server (apache#6844) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5046] Support all the hive sync options for flink sql (apache#6985) * [MINOR] fix cdc flake ut (apache#7016) * [MINOR] Remove redundant space in PR compliance check (apache#7022) * [HUDI-5063] Enabling run time stats to be serialized with commit metadata (apache#7006) * [HUDI-5070] Adding lock provider to testCleaner tests since async cleaning is invoked (apache#7023) * [HUDI-5070] Move flaky cleaner tests to separate class (apache#7034) * [HUDI-4971] Remove direct use of kryo from `SerDeUtils` (apache#7014) Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> * [HUDI-5081] Tests clean up in hudi-utilities (apache#7033) * [HUDI-5027] Replace hardcoded hbase config keys with constant variables (apache#6946) * [MINOR] add commit_action output in show_commits (apache#7012) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception (apache#7001) Co-authored-by: liufangqi.chenfeng <liufangqi.chenfeng@BYTEDANCE.COM> * [MINOR] Skip loading last completed txn for single writer (apache#6660) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4281] Using hudi to build a large number of tables in spark on hive causes OOM (apache#5903) * [HUDI-5042] Fix clustering schedule problem in flink when enable schedule clustering and disable async clustering (apache#6976) Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4753] more accurate record size estimation for log writing and spillable map (apache#6632) * [HUDI-4201] Cli tool to get warned about empty non-completed instants from timeline (apache#6867) * [HUDI-5038] Increase default num_instants to fetch for incremental source (apache#6955) * [HUDI-5049] Supports dropPartition for Flink catalog (apache#6991) * for both dfs and hms catalogs * [HUDI-4809] glue support drop partitions (apache#7007) Co-authored-by: xxhua <xxhua@freewheel.tv> * [HUDI-5057] Fix msck repair hudi table (apache#6999) * [HUDI-4959] Fixing Avro's `Utf8` serialization in Kryo (apache#7024) * temp_view_support (apache#6990) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4982] Add Utilities and Utilities Slim + Spark Bundle testing to GH Actions (apache#7005) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5085]When a flink job has multiple sink tables, the index loading status is abnormal (apache#7051) * [HUDI-5089] Refactor HoodieCommitMetadata deserialization (apache#7055) * [HUDI-5058] Fix flink catalog read spark table error : primary key col can not be nullable (apache#7009) * [HUDI-5087] Fix incorrect merging sequence for Column Stats Record in `HoodieMetadataPayload` (apache#7053) * [HUDI-5087]Fix incorrect maxValue getting from metatable [HUDI-5087]Fix incorrect maxValue getting from metatable * Fixed `HoodieMetadataPayload` merging seq; Added test * Fixing handling of deletes; Added tests for handling deletes; * Added tests for combining partition files-list record Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> * [HUDI-4946] fix merge into with no preCombineField having dup row by only insert (apache#6824) * [HUDI-5072] Extract `ExecutionStrategy#transform` duplicate code (apache#7030) * [HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle (apache#6079) * [HUDI-5000] Support schema evolution for Hive/presto (apache#6989) Co-authored-by: z00484332 <zhaolong36@huawei.com> * [HUDI-4716] Avoid parquet-hadoop-bundle in hudi-hadoop-mr (apache#6930) * [HUDI-5035] Remove usage of deprecated HoodieTimer constructor (apache#6952) Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5083]Fixed a bug when schema evolution (apache#7045) * [HUDI-5102] source operator(monitor and reader) support user uid (apache#7085) * Update HoodieTableSource.java Co-authored-by: chenzhiming <chenzhm@chinatelecom.cn> * [HUDI-5057] Fix msck repair external hudi table (apache#7084) * [MINOR] Fix typos in Spark client related classes (apache#7083) * [HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796) Co-authored-by: jian.feng <jian.feng@shopee.com> * [MINOR] use default maven version since it already fix the warnings recently (apache#6863) Co-authored-by: jian.feng <jian.feng@shopee.com> * Revert "[HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796)" (apache#7090) This reverts commit e222693. * [MINOR] Fix doc of org.apache.hudi.sink.meta.CkpMetadata#bootstrap (apache#7048) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4799] improve analyzer exception tip when cannot resolve expression (apache#6625) * [HUDI-5096] Upgrade jcommander to 1.78 (apache#7068) - resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (apache#7091) * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql * [HUDI-5107] Fix hadoop config in DirectWriteMarkers, HoodieFlinkEngineContext and StreamerUtil are not consistent issue (apache#7094) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [MINOR] Fix OverwriteWithLatestAvroPayload full class name (apache#7096) * [HUDI-5074] Warn if table for metastore sync has capitals in it (apache#7077) Co-authored-by: Jonathan Vexler <=> * [HUDI-5124] Fix HoodieInternalRowFileWriter#canWrite error return tag. (apache#7107) Co-authored-by: slfan1989 <louj1988@@> * [MINOR] update commons-codec:commons-codec 1.4 to 1.13 (apache#6959) * [HUDI-5148] Claim RFC-63 for Index on Function and Logical Partitioning (apache#7114) * [HUDI-5065] Call close on SparkRDDWriteClient in HoodieCleaner (apache#7101) Co-authored-by: Jonathan Vexler <=> * [HUDI-4624] Implement Closable for S3EventsSource (apache#7086) Co-authored-by: Jonathan Vexler <=> * [HUDI-5045] Adding support to configure index type with integ tests (apache#6982) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (apache#5416) https://issues.apache.org/jira/browse/HUDI-3963 RFC design : apache#5567 Add Lock-Free executor to improve hoodie writing throughput and optimize execution efficiency. Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction. Existing BoundedInMemory is the default. Users can enable on a need basis. Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-5076] Fixing non serializable path used in engineContext with metadata table intialization (apache#7036) * [HUDI-5032] Add archive to cli (apache#7076) Adding archiving capability to cli. Co-authored-by: Jonathan Vexler <=> * [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task (apache#6733) * [HUDI-5147] Flink data skipping doesn't work when HepPlanner calls copy()… (apache#7113) * [HUDI-5147] Flink data skipping doesn't work when HepPlanner calls copy() on HoodieTableSource * [MINOR] Fixing broken test (apache#7123) * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table (apache#6741) * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table * Update HiveAvroSerializer.java otherwise payload string type combine field will cause cast exception * [HUDI-5126] Delete duplicate configuration items PAYLOAD_CLASS_NAME (apache#7103) * [HUDI-4989] Fixing deltastreamer init failures (apache#6862) Fixing handling missing hoodie.properties * [MINOR] Fix flaky test in ITTestHoodieDataSource (apache#7134) * [HUDI-4071] Remove default value for mandatory record key field (apache#6681) * [HUDI-5088]Fix bug:Failed to synchronize the hive metadata of the Flink table (apache#7056) * sync `_hoodie_operation` meta field if changelog mode is enabled. * [MINOR] Removing spark2 scala12 combinations from readme (apache#7112) * [HUDI-5153] Fix the write token name resolution of cdc log file (apache#7128) * [HUDI-5066] Support flink hoodie source metaclient cache (apache#7017) * [HUDI-5132] Add hadoop-mr bundle validation (apache#7157) * [HUDI-2673] Add kafka connect bundle to validation test (apache#7131) * [HUDI-5082] Improve the cdc log file name format (apache#7042) * [HUDI-5154] Improve hudi-spark-client Lambada writing (apache#7127) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5178] Add Call show_table_properties for spark sql (apache#7161) * [HUDI-5067] Merge the columns stats of multiple log blocks from the same log file (apache#7018) * [HUDI-5025] Rollback failed with log file not found when rollOver in rollback process (apache#6939) * fix rollback file not found * [HUDI-4526] Improve spillableMapBasePath when disk directory is full (apache#6284) * [minor] Refactor the code for CkpMetadata (apache#7166) * [HUDI-5111] Improve integration test coverage (apache#7092) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5187] Remove the preCondition check of BucketAssigner assign state (apache#7170) * [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (apache#7171) * [MINOR] Performance improvement of flink ITs with reused miniCluster (apache#7151) * implement MiniCluster extension compatible with junit5 * Make local build work * Delete files removed in OSS * Fix bug in testing * Upgrade to version release-v0.10.0 Co-authored-by: 5herhom <543872547@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Jon Vexler <jon@onehouse.ai> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: y0908105023 <283999377@qq.com> Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Volodymyr Burenin <vburenin@gmail.com> Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: FocusComputing <xiaoxingstack@gmail.com> Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> Co-authored-by: Paul Zhang <xzhangyao@126.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: eric9204 <90449228+eric9204@users.noreply.github.com> Co-authored-by: dongsj <dongsj@asiainfo.com> Co-authored-by: Kyle Zhike Chen <zk.chan007@gmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: dohongdayi <dohongdayi@126.com> Co-authored-by: shaoxiong.zhan <31836510+microbearz@users.noreply.github.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Manu <36392121+xicm@users.noreply.github.com> Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net> Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: RexAn <bonean131@gmail.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com> Co-authored-by: Rahil Chertara <rchertar@amazon.com> Co-authored-by: Timothy Brown <tim@onehouse.ai> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: wangxianghu <wangxianghu@apache.org> Co-authored-by: wulei <wulei.1023@bytedance.com> Co-authored-by: Xingjun Wang <wongxingjun@126.com> Co-authored-by: Prasanna Rajaperumal <prasanna.raj@live.com> Co-authored-by: xingjunwang <xingjunwang@tencent.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: 苏承祥 <sucx@tuya.com> Co-authored-by: ChanKyeong Won <brightwon.dev@gmail.com> Co-authored-by: Zouxxyy <zouxxyy@qq.com> Co-authored-by: Nicholas Jiang <programgeek@163.com> Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: TengHuo <teng_huo@outlook.com> Co-authored-by: hj2016 <hj3245459@163.com> Co-authored-by: huangjing02 <huangjing02@bilibili.com> Co-authored-by: jsbali <jsbali@uber.com> Co-authored-by: Leon Tsao <31072303+gnailJC@users.noreply.github.com> Co-authored-by: leon <leon@leondeMacBook-Pro.local> Co-authored-by: 申胜利 <48829688+shenshengli@users.noreply.github.com> Co-authored-by: aiden.dong <782112163@qq.com> Co-authored-by: dujunling <dujunling@bytedance.com> Co-authored-by: Pramod Biligiri <pramodbiligiri@gmail.com> Co-authored-by: Zouxxyy <zouxinyu.zxy@alibaba-inc.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Surya Prasanna <syalla@uber.com> Co-authored-by: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: huberylee <shibei.lh@foxmail.com> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: slfan1989 <55643692+slfan1989@users.noreply.github.com> Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: wangzeyu <hameizi369@gmail.com> Co-authored-by: vvsd <40269480+vvsd@users.noreply.github.com> Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com> Co-authored-by: Bingeng Huang <304979636@qq.com> Co-authored-by: hbg <bingeng.huang@shopee.com> Co-authored-by: that's cool <1059023054@qq.com> Co-authored-by: liufangqi.chenfeng <liufangqi.chenfeng@BYTEDANCE.COM> Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com> Co-authored-by: gavin <zhangrenhuaman@163.com> Co-authored-by: Jon Vexler <jbvexler@gmail.com> Co-authored-by: Xixi Hua <smilecrazy1h@gmail.com> Co-authored-by: xxhua <xxhua@freewheel.tv> Co-authored-by: YangXiao <919869387@qq.com> Co-authored-by: chao chen <59957056+waywtdcc@users.noreply.github.com> Co-authored-by: Zhangshunyu <zhangshunyu1990@126.com> Co-authored-by: Long Zhao <294514940@qq.com> Co-authored-by: z00484332 <zhaolong36@huawei.com> Co-authored-by: 矛始 <1032851561@qq.com> Co-authored-by: chenzhiming <chenzhm@chinatelecom.cn> Co-authored-by: lvhu-goodluck <81349721+lvhu-goodluck@users.noreply.github.com> Co-authored-by: alberic <cnuliuweiren@gmail.com> Co-authored-by: lxxyyds <114218541+lxxawfl@users.noreply.github.com> Co-authored-by: Alexander Trushev <42293632+trushev@users.noreply.github.com> Co-authored-by: xiarixiaoyao <mengtao0326@qq.com> Co-authored-by: windWheel <1817802738@qq.com> Co-authored-by: Alexander Trushev <trushev.alex@gmail.com> Co-authored-by: Shizhi Chen <107476116+chenshzh@users.noreply.github.com>

yuezhang added 5 commits April 24, 2022 22:00

RFC 51

5eb579d

TBD

2231df8

rfc details

31cd62c

rfc details

80a2f2e

merge from master

5a99503

zhangyue19921010 mentioned this pull request May 12, 2022

[HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency #5416

Merged

4 tasks

leesf self-assigned this May 16, 2022

leesf reviewed May 16, 2022

View reviewed changes

nsivabalan added the rfc label May 17, 2022

design review

765c7c9

XuQianJin-Stars requested a review from vinothchandar May 17, 2022 01:59

zhangyue19921010 requested a review from leesf May 19, 2022 01:02

leesf reviewed May 20, 2022

View reviewed changes

address comments

6edd958

leesf approved these changes May 26, 2022

View reviewed changes

leesf merged commit 85962ee into apache:master May 26, 2022

Clcanny pushed a commit to Clcanny/hudi that referenced this pull request May 28, 2022

[HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving H…

fe2ff82

…oodie Writing Efficiency (apache#5567) Co-authored-by: yuezhang <yuezhang@freewheel.tv>

danny0405 reviewed May 30, 2022

View reviewed changes

xushiyan changed the title ~~[RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency~~ [HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency Aug 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency #5567

[HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency #5567

zhangyue19921010 commented May 12, 2022

zhangyue19921010 commented May 12, 2022

leesf May 16, 2022

zhangyue19921010 May 17, 2022

leesf May 16, 2022

zhangyue19921010 May 17, 2022

leesf May 16, 2022

zhangyue19921010 May 17, 2022

leesf May 16, 2022

zhangyue19921010 May 17, 2022

YuweiXiao May 20, 2022

zhangyue19921010 May 20, 2022 •

edited

Loading

zhangyue19921010 May 20, 2022

YuweiXiao May 20, 2022

leesf May 20, 2022

zhangyue19921010 May 20, 2022 •

edited

Loading

zhangyue19921010 commented May 17, 2022

zhangyue19921010 commented May 17, 2022

leesf May 20, 2022

zhangyue19921010 May 20, 2022

leesf May 20, 2022

zhangyue19921010 May 20, 2022

zhangyue19921010 commented May 23, 2022

hudi-bot commented May 23, 2022

zhangyue19921010 commented May 30, 2022

danny0405 May 30, 2022

[HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency #5567

[HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency #5567

Conversation

zhangyue19921010 commented May 12, 2022

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

zhangyue19921010 commented May 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 May 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 May 20, 2022 • edited Loading

Choose a reason for hiding this comment

zhangyue19921010 commented May 17, 2022

zhangyue19921010 commented May 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 commented May 23, 2022

hudi-bot commented May 23, 2022

CI report:

zhangyue19921010 commented May 30, 2022

Choose a reason for hiding this comment

zhangyue19921010 May 20, 2022 •

edited

Loading

zhangyue19921010 May 20, 2022 •

edited

Loading