Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Claim RFC-62 Diagnostic Reporter #6599

Merged
merged 1 commit into from
Sep 5, 2022

Conversation

zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Sep 5, 2022

Change Logs

With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements.
Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc.

For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs, data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers.
By the way, unexpected errors may occur at this time as users are manually collecting these information.

Obviously, there are relatively high communication costs for both volunteers and users.

On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on.

In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.
This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file.

Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file.

Impact

no impact

**Risk level: none **

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan
Copy link
Contributor

can you fill in the PR template please.

@hudi-bot
Copy link

hudi-bot commented Sep 5, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua changed the title [RFC] claim RFC-62 Diagnostic Reporter [RFC] Claim RFC-62 Diagnostic Reporter Sep 5, 2022
@yihua yihua merged commit af78567 into apache:master Sep 5, 2022
@zhangyue19921010
Copy link
Contributor Author

can you fill in the PR template please.

Sure! added

yuzhaojing pushed a commit that referenced this pull request Sep 23, 2022
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
TengHuo pushed a commit to TengHuo/hudi that referenced this pull request Nov 28, 2022
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
neverdizzy pushed a commit to neverdizzy/hudi that referenced this pull request Dec 13, 2022
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
(cherry picked from commit af78567)
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
vinishjail97 added a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
* [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458)

* [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048)

Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file

* [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445)

Co-authored-by: jerryyue <jerryyue@didiglobal.com>

* [HUDI-4353] Column stats data skipping for flink (apache#6026)

* [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012)

Co-authored-by: superche <superche@tencent.com>

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854)

* [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-3511] Add call procedure for MetadataCommand (apache#6018)

* [HUDI-3730] Add ConfigTool#toMap UT (apache#6035)

Co-authored-by: voonhou.su <voonhou.su@shopee.com>

* [MINOR] Improve variable names (apache#6039)

* [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043)

* [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286)

Co-authored-by: xicm <xicm@asiainfo.com>

* [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042)

* [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029)

* [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828)

* [HUDI-4357] Support flink 1.15.x (apache#6050)

* [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677)

* [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once

* [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted

* [HUDI-4152] Provider UT & IT for compact multi compaction plan

* [HUDI-4152] Put multi compaction plans into one compaction plan source

* [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma

* [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy

* [HUDI-4309] fix spark32 repartition error (apache#6033)

* [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051)

* [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060)

* [HUDI-4367] Support copyToTable on call (apache#6054)

* [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995)

* fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check

* Fix - serde parameters getting overrided on table property update

* removing stale syncConfig

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017)

* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>

* [HUDI-3500] Add call procedure for RepairsCommand (apache#6053)

* [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061)

- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig

* [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062)

Bumps xalan from 2.7.1 to 2.7.2.

---
updated-dependencies:
- dependency-name: xalan:xalan
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)

* [HUDI-4324] Remove use_jdbc config from hudi sync
* Users should use HIVE_SYNC_MODE instead

* [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695)

* [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies

Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4323] Make database table names optional in sync tool (apache#6073)

* [HUDI-4323] Make database table names optional in sync tool
* Infer from these properties from the table config

* [MINOR] Update RFCs status (apache#6078)

* [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937)

* [HUDI-4298] Add test case for reading mor table

Signed-off-by: LinMingQiang <1356469429@qq.com>

* [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080)

* [HUDI-4391] Incremental read from archived commits for flink (apache#6096)

* [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436)



Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103)

* [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112)

* [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106)

Co-authored-by: jerryyue <jerryyue@didiglobal.com>

* [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119)

* [HUDI-4403] Fix the end input metadata for bounded source (apache#6116)

* [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120)

* [HUDI-3503]  Add call procedure for CleanCommand (apache#6065)

* [HUDI-3503] Add call procedure for CleanCommand
Co-authored-by: simonssu <simonssu@tencent.com>

* [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily  (apache#5855)

* [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722)

* Rebase codes

* Move listFileSlices to HoodieBaseRelation

* Fix review

* Fix style

* Fix bug

* Fix file group count issue with metadata partitions (apache#5892)

* [HUDI-4098] Support HMS for flink HudiCatalog (apache#6082)

* [HUDI-4098]Support HMS for flink HudiCatalog

* [HUDI-4409] Improve LockManager wait logic when catch exception (apache#6122)

* [HUDI-4065] Add FileBasedLockProvider (apache#6071)

* [HUDI-4416] Default database path for hoodie hive catalog (apache#6136)

* [HUDI-4372] Enable matadata table by default for flink (apache#6066)

* [HUDI-4401] Skip HBase version check (apache#6114)

* Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

* [HUDI-4427] Add a computed column IT test (apache#6150)

* [HUDI-4146][RFC-55] Update config changes proposal (apache#6162)

* [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (apache#5428)

Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation

* [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (apache#4915)

Currently when doing Hudi queries w/ Spark, it won't 
load the external configurations. Say if customers enabled 
metadata listing in their global config file, then this would 
let them actually query w/o metadata feature enabled. 
This PR fixes this issue and allows loading global 
configs during the Hudi reading phase.

Co-authored-by: Wenning Ding <wenningd@amazon.com>

* [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (apache#5470)

* [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (apache#6161)

Co-authored-by: Wenning Ding <wenningd@amazon.com>

* [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (apache#6113)

Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.

* [HUDI-4204] Fixing NPE with row writer path and with OCC (apache#5850)

* [HUDI-4247] Upgrading protocol buffers version for presto bundle (apache#5852)

* [MINOR] Fix result missing information issue in commits_compare Procedure (apache#6165)

Co-authored-by: superche <superche@tencent.com>

* [HUDI-4404] Fix insert into dynamic partition write misalignment (apache#6124)

* [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (apache#6175)

- Fixes broken ITTestHoodieDemo#testParquetDemo

* [MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)

* [HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (apache#5523)

This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)

* [MINOR] Disable Flink compactor IT test (apache#6189)

* Revert "[MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)" (apache#6192)

This reverts commit d5c904e.

* [HUDI-3979] Optimize out mandatory columns when no merging is performed (apache#5430)

For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out.

* [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (apache#5954)

* Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)" (apache#6160)

This reverts commit 046044c.

* [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (apache#6155)

Co-authored-by: Wenning Ding <wenningd@amazon.com>

* [HUDI-4437] Fix test conflicts by clearing file system cache (apache#6123)

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4436] Invalidate cached table in Spark after write (apache#6159)

Co-authored-by: Ryan Pifer <rmpifer@umich.edu>

* [MINOR] Fix Call Procedure code style (apache#6186)

* Fix Call Procedure code style.
Co-authored-by: superche <superche@tencent.com>

* [MINOR] Bump CI timeout to 150m (apache#6198)

* [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (apache#6163)

Co-authored-by: Ryan Pifer <rmpifer@umich.edu>

* [HUDI-4071] Make NONE sort mode as default for bulk insert (apache#6195)

* [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations  (apache#5708)

* [HUDI-4448] Remove the latest commit refresh for timeline server (apache#6179)

* [HUDI-4450] Revert the checkpoint abort notification (apache#6181)

* [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (apache#6164)

Co-authored-by: Udit Mehrotra <uditme@amazon.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4348] fix merge into sql data quality in concurrent scene (apache#6020)

* [HUDI-3510] Add sync validate procedure (apache#6200)

* [HUDI-3510] Add sync validate procedure

Co-authored-by: simonssu <simonssu@tencent.com>

* [MINOR] Fix typos in Spark client related classes (apache#6204)

* [HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness  (apache#6201)

* [MINOR] Only log stdout output for non-zero exit from commands in IT (apache#6199)

* [HUDI-4458] Add a converter cache for flink ColumnStatsIndices (apache#6205)

* [HUDI-4071] Match ROLLBACK_USING_MARKERS_ENABLE in sql as datasource (apache#6206)

Co-authored-by: superche <superche@tencent.com>

* [HUDI-4455] Improve test classes for TestHiveSyncTool (apache#6202)

Improve HiveTestService, HiveTestUtil, and related classes.

* [HUDI-4456] Clean up test resources (apache#6203)

* [HUDI-3884] Support archival beyond savepoint commits (apache#5837)


Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping  (apache#5746)

We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.

* [HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common

* [HUDI-4474] Infer metasync configs (apache#6217)

- infer repeated sync configs from original configs
  - `META_SYNC_BASE_FILE_FORMAT`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT`
  - `META_SYNC_ASSUME_DATE_PARTITION`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING`
  - `META_SYNC_DECODE_PARTITION`
    - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING`
  - `META_SYNC_USE_FILE_LISTING_FROM_METADATA`
    - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE`

As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes

* [HUDI-4210] Create custom hbase index to solve data skew issue on hbase regions (apache#5797)

* [HUDI-3730] Keep metasync configs backward compatible (apache#6221)

* [HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (apache#6214)

* [HUDI-4186] Support Hudi with Spark 3.3.0 (apache#5943)

Co-authored-by: Shawn Chang <yxchang@amazon.com>

* [HUDI-4126] Disable file splits for Bootstrap real time queries (via InputFormat) (apache#6219)


Co-authored-by: Udit Mehrotra <uditme@amazon.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4490] Make AWSDmsAvroPayload class backwards compatible (apache#6229)

Co-authored-by: Rahil Chertara <rchertar@amazon.com>

* [HUDI-4484] Add default lock config options for flink metadata table (apache#6222)

* [HUDI-4494] keep the fields' order when data is written out of order (apache#6233)

* [MINOR] Minor changes around Spark 3.3 support (apache#6231)

Co-authored-by: Shawn Chang <yxchang@amazon.com>

* [HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (apache#6213)

* [HUDI-4495] Fix handling of S3 paths incompatible with java URI standards (apache#6237)

* [HUDI-4499] Tweak default retry times for flink metadata table lock (apache#6238)

* [HUDI-4221] Optimzing getAllPartitionPaths  (apache#6234)

- Levering spark par for dir processing

* Moving to 0.13.0-SNAPSHOT on master branch.

* [HUDI-4504] Disable metadata table by default for flink (apache#6241)

* [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (apache#6242)

To avoid unnecessary exception throws

* [HUDI-4507] Improve file name extraction logic in metadata utils (apache#6250)

* [MINOR] Fix convertPathWithScheme tests (apache#6251)

* [MINOR] Add license header (apache#6247)

Add license header to TestConfigUtils

* [HUDI-4025] Add Presto and Trino query node to validate queries (apache#5578)

* Add Presto and Trino query nodes to hudi-integ-test
* Add yamls for query validation
* Add presto-jdbc and trino-jdbc to integ-test-bundle

* [HUDI-4518] Free lock if allocated but not acquired (apache#6272)

If the lock is not null but its state has not yet transitioned to 
ACQUIRED, retry fails because the lock is not de-allocated. 
See issue apache#5702

* [HUDI-4510] Repair config "hive_sync.metastore.uris" in flink sql hive schema sync is not effective (apache#6257)

* [HUDI-3848] Fixing minor bug in listing based rollback request generation (apache#6244)

* [HUDI-4512][HUDI-4513] Fix bundle name for spark3 profile (apache#6261)

* [HUDI-4501] Throwing exception when restore is attempted with hoodie.arhive.beyond.savepoint is enabled (apache#6239)

* [HUDI-4516] fix Task not serializable error when run HoodieCleaner after one failure (apache#6265)


Co-authored-by: jian.feng <jian.feng@shopee.com>

* remove test resources (apache#6147)

Co-authored-by: root <root@TCN1004532-1.tcent.cn>

* [HUDI-4477] Adjust partition number of flink sink task (apache#6218)

Co-authored-by: lewinma <lewinma@tencent.com>

* [HUDI-4298] Mor table reading for base and log files lost sequence of events (apache#6286)

* [HUDI-4298] Mor table reading for base and log files lost sequence of events

Signed-off-by: HunterXHunter <1356469429@qq.com>

* [HUDI-4525] Fixing Spark 3.3 `AvroSerializer` implementation (apache#6279)

* [HUDI-4447] fix no partitioned path extractor error when sync meta (apache#6263)

* [HUDI-4520] Support qualified table 'db.table' in call procedures (apache#6274)

* [HUDI-4531] Wrong partition path for flink hive catalog when the partition fields are not in the last (apache#6292)

* [HUDI-4487] support to create ro/rt table by spark sql (apache#6262)

* [HUDI-4533] Fix RunCleanProcedure's ArrayIndexOutOfBoundsException (apache#6293)

* [HUDI-4536] ClusteringOperator causes the NullPointerException when writing with BulkInsertWriterHelper in clustering (apache#6298)

* [HUDI-4385] Support online compaction in the flink batch mode write (apache#6093)

* [HUDI-4385] Support online compaction in the flink batch mode write

Signed-off-by: HunterXHunter <1356469429@qq.com>

* [HUDI-4530] fix default payloadclass in mor is different with cow (apache#6288)

* [HUDI-4545] Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload (apache#6306)

* [HUDI-4544] support retain hour cleaning policy for flink (apache#6300)

* [HUDI-4547] Fix SortOperatorGen sort indices (apache#6309)

Signed-off-by: HunterXHunter <1356469429@qq.com>

* [HUDI-4470] Remove spark dataPrefetch disabled prop in DefaultSource

* [HUDI-4540] Cover different table types in functional tests of Spark structured streaming (apache#6317)

* [HUDI-4514] optimize CTAS to adapt to saveAsTable api in different modes (apache#6295)

* [HUDI-4474] Fix inferring props for meta sync (apache#6310)

- HoodieConfig#setDefaults looks up declared fields, so 
  should pass static class for reflection, otherwise, subclasses 
  of HoodieSyncConfig won't set defaults properly.
- Pass all write client configs of deltastreamer to meta sync
- Make org.apache.hudi.hive.MultiPartKeysValueExtractor 
  default for deltastreamer, to align with SQL and flink

* [HUDI-4550] Fallback to listing based rollback for completed instant (apache#6313)

Ideally, rollback is not triggered for completed instants. 
However, if it gets triggered due to some extraneous condition 
or forced while rollback strategy still configured to be marker-based, 
then fallback to listing-based rollback instead of failing.

- CTOR changes in rollback plan and action executors.
- Change in condition to determine whether to use marker-based rollback.
- Added UT to cover the scenario.

* [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value (apache#6248)

- Added FourToFiveUpgradeHandler to detect hudi tables with "default" partition and throwing exception.
- Added a new write config ("hoodie.skip.default.partition.validation") when enabled, will bypass the above validation. If users have a hudi table where "default" partition was created intentionally and not as sentinel, they can enable this config to get past the validation.

* [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis (apache#6307)

* [HUDI-4534] Fixing upgrade to reload Metaclient for deltastreamer writes (apache#6296)

* [HUDI-4517] If no marker type file, fallback to timeline based marker (apache#6266)

- If MARKERS.type file is not present, the logic assumes that the direct markers are stored, which causes the read failure in certain cases even where timeline server based marker is enabled. This PR handles the failure by falling back to timeline based marker in such cases.

* [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)

- Adding request retry to RemoteHoodieTableFileSystemView. Users can enable using the new configs added.

* [HUDI-4464] Clear warnings in Azure CI (apache#6210)


Co-authored-by: jian.feng <jian.feng@shopee.com>

* [MINOR] Update PR description template (apache#6323)

* [HUDI-4508] Repair the exception when reading optimized query for mor in hive and presto/trino (apache#6254)

In MOR table, file slice may just have log file but no base file, 
before the file slice is compacted. In this case, read-optimized 
query will match the condition !baseFileOpt.isPresent() in HoodieCopyOnWriteTableInputFormat.createFileStatusUnchecked() 
and throw IllegalStateException.

Instead of throwing exception, 
it is more suitable to query nothing in the file slice.

Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4548] Unpack the column max/min to string instead of Utf8 for Mor table (apache#6311)

* [HUDI-4447] fix SQL metasync when perform delete table operation (apache#6180)

* [HUDI-4424] Add new compactoin trigger stratgy: NUM_COMMITS_AFTER_REQ… (apache#6144)

* [MINOR] improve flink dummySink's parallelism (apache#6325)

* [HUDI-4568] Shade dropwizard metrics-core in hudi-aws-bundle (apache#6327)

* [HUDI-4572] Fix 'Not a valid schema field: ts' error in HoodieFlinkCompactor if precombine field is not ts (apache#6331)

Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4570] Fix hive sync path error due to reuse of storage descriptors. (apache#6329)

* [HUDI-4571] Fix partition extractor infer function when partition field mismatch (apache#6333)

Infer META_SYNC_PARTITION_FIELDS and 
META_SYNC_PARTITION_EXTRACTOR_CLASS 
from hoodie.table.partition.fields first. 
If not set, then from hoodie.datasource.write.partitionpath.field.

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4570] Add test for updating multiple partitions in hive sync (apache#6340)

* [MINOR] Fix wrong key to determine sync sql cascade (apache#6339)

* [HUDI-4581] Claim RFC-58 for data skipping integration with query engines (apache#6346)

* [HUDI-4577] Adding test coverage for `DELETE FROM`, Spark Quickstart guide (apache#6318)

* [HUDI-4556] Improve functional test coverage of column stats index (apache#6319)

* [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties (apache#6320)

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [HUDI-4543] Support natural order when table schema contains a field named 'ts' (apache#6246)

* be able to disable precombine field when table schema contains a field named ts

Co-authored-by: jian yonghua <jianyonghua@163.com>

* [HUDI-4569][RFC-58] Claim RFC-58 for adding a new feature named 'Multiple event_time Fields Latest Verification in a Single Table' for Hudi (apache#6328)

Co-authored-by: XinyaoTian <leontian1024@gmail.com>

* [HUDI-3503] Support more feature to call procedure CleanCommand (apache#6353)

* [HUDI-4590] Add hudi-aws dependency to hudi-flink-bundle. (apache#6356)

* [MINOR] fix potential npe in spark writer (apache#6363)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* fix bug in cli show fsview all (apache#6314)

* [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency (apache#6228)

* [HUDI-4611] Fix the duplicate creation of config in HoodieFlinkStreamer (apache#6369)

Co-authored-by: linfey <linfey2021@gmail.com>

* [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table (apache#6141)

* Spark support MOR read archived commits for incremental query

* [MINOR] fix progress field calculate logic in HoodieLogRecordReader (apache#6291)

* [HUDI-4608] Fix upgrade command in Hudi CLI (apache#6374)

* [HUDI-4609] Improve usability of upgrade/downgrade commands in Hudi CLI (apache#6377)

* [HUDI-4574] Fixed timeline based marker thread safety issue (apache#6383)

* fixed timeline based markers thread safety issue
* add document for TimelineBasedMarkers thread safety issues

* [HUDI-4621] Add validation that bucket index fields should be subset of primary keys (apache#6396)

* check bucket index fields

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [HUDI-4354] Add --force-empty-sync flag to deltastreamer (apache#6027)

* [HUDI-4601] Read error from MOR table after compaction with timestamp partitioning (apache#6365)

* read error from mor after compaction

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [MINOR] Update DOAP with 0.12.0 Release (apache#6413)

* [HUDI-4529] Tweak some default config options for flink (apache#6287)

* [HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)

* [HUDI-4551] Tweak the default parallelism of flink pipeline to execution env  parallelism (apache#6312)

* [MINOR] Improve code style of CLI Command classes (apache#6427)

* [HUDI-3625] Claim RFC-60 for Federated Storage Layer (apache#6440)

* [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar (apache#6386)

- Adding PulsarSource to DeltaStreamer to support ingesting from Apache Pulsar.
- Current implementation of PulsarSource is relying on "pulsar-spark-connector" to ingest using Spark instead of building similar pipeline from scratch.

* [HUDI-3579] Add timeline commands in hudi-cli (apache#5139)

* [HUDI-4638] Rename payload clazz and preCombine field options for flink sql (apache#6434)

* Revert "[HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)" (apache#6449)

This reverts commit 9055b2f.

* [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set (apache#6443)

* [HUDI-4644] Change default flink profile to 1.15.x (apache#6445)

* [HUDI-4678] Claim RFC-61 for Snapshot view management (apache#6461)

Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC (apache#6459)

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC
Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4683] Use enum class value for default value in flink options (apache#6453)

* [HUDI-4584] Cleaning up Spark utilities (apache#6351)

Cleans up Spark utilities and removes duplication

* [HUDI-4686] Flip option 'write.ignore.failed' to default false (apache#6467)

Also fix the flaky test

* [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (apache#6267)

* [HUDI-4637] Release thread in RateLimiter doesn't been terminated (apache#6433)

* [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (apache#6481)

* HUDI-4687 add show_invalid_parquet procedure (apache#6480)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [HUDI-4584] Fixing `SQLConf` not being propagated to executor (apache#6352)

Fixes `HoodieSparkUtils.createRDD` to make sure `SQLConf` is properly propagated to the executor (required by `AvroSerializer`)

* [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies (apache#6170)

* [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false (apache#6450)

* [HUDI-4713] Fix flaky ITTestHoodieDataSource#testAppendWrite (apache#6490)

* [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (apache#6494)

* Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)" (apache#6501)

This reverts commit 660177b.

* [Stacked on 6386] Fixing `DebeziumSource` to properly commit offsets; (apache#6416)

* [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer (apache#6111)

* [HUDI-4703] use the historical schema to response time travel query (apache#6499)

* [HUDI-4703] use the historical schema to response time travel query

* [HUDI-4549]  Remove avro from hudi-hive-sync-bundle and hudi-aws-bundle (apache#6472)

* Remove avro shading from hudi-hive-sync-bundle
   and hudi-aws-bundle.

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4482] remove guava and use caffeine instead for cache (apache#6240)

* [HUDI-4483] Fix checkstyle in integ-test module (apache#6523)

* [HUDI-4340] fix not parsable text DateTimeParseException by addng a method parseDateFromInstantTimeSafely for parsing timestamp when output metrics (apache#6000)

* [DOCS] Add docs about javax.security.auth.login.LoginException when starting Hudi Sink Connector (apache#6255)

* [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (apache#6533)

* [HUDI-4730] Fix batch job cannot clean old commits files (apache#6515)

* [HUDI-4370] Fix batch job cannot clean old commits files

Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4740] Add metadata fields for hive catalog #createTable (apache#6541)

* [HUDI-4695] Fixing flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime (apache#6534)

* [HUDI-4193] change protoc version to unblock hudi compilation on m1 mac (apache#6535)

* [HUDI-4438] Fix flaky TestCopyOnWriteActionExecutor#testPartitionMetafileFormat (apache#6546)

* [MINOR] Fix typo in HoodieArchivalConfig (apache#6542)

* [HUDI-4582] Support batch synchronization of partition to HMS to avoid timeout (apache#6347)


Co-authored-by: xxhua <xxhua@freewheel.tv>

* [HUDI-4742] Fix AWS Glue partition's location is wrong when updatePartition (apache#6545)

Co-authored-by: xxhua <xxhua@freewheel.tv>

* [HUDI-4418] Add support for ProtoKafkaSource (apache#6135)

- Adds PROTO to Source.SourceType enum.
- Handles PROTO type in SourceFormatAdapter by converting to Avro from proto Message objects. 
   Conversion to Row goes Proto -> Avro -> Row currently.
- Added ProtoClassBasedSchemaProvider to generate schemas for a proto class that is currently on the classpath.
- Added ProtoKafkaSource which parses byte[] into a class that is on the path.
- Added ProtoConversionUtil which exposes methods for creating schemas and 
   translating from Proto messages to Avro GenericRecords.
- Added KafkaSource which provides a base class for the other Kafka sources to use.

* [HUDI-4642] Adding support to hudi-cli to repair deprecated partition (apache#6438)

* [HUDI-4751] Fix owner instants for transaction manager api callers (apache#6549)

* [HUDI-4739] Wrong value returned when key's length equals 1 (apache#6539)

* extracts key fields

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [HUDI-4528] Add diff tool to compare commit metadata (apache#6485)

* Add diff tool to compare commit metadata
* Add partition level info to commits and compaction command
* Partition support for compaction archived timeline
* Add diff command test

* [HUDI-4648] Support rename partition through CLI (apache#6569)

* [HUDI-4775] Fixing incremental source for MOR table (apache#6587)

* Fixing incremental source for MOR table

* Remove unused import

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [HUDI-4694] Print testcase running time for CI jobs (apache#6586)

* [RFC] Claim RFC-62 for Diagnostic Reporter (apache#6599)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [minor] following HUDI-4739, fix the extraction for simple record keys (apache#6594)

* [HUDI-4619] Add a remote request retry mechanism for 'Remotehoodietablefilesystemview'. (apache#6393)

* [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when source not contains meta fields (apache#6500)

Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>

* [HUDI-4389] Make HoodieStreamingSink idempotent (apache#6098)

* Support checkpoint and idempotent writes in HoodieStreamingSink

- Use batchId as the checkpoint key and add to commit metadata
- Support multi-writer for checkpoint data model

* Walk back previous commits until checkpoint is found

* Handle delete operation and fix test

* [MINOR] Remove redundant braces (apache#6604)

* [HUDI-4618] Separate log word for CommitUitls class (apache#6392)

* [HUDI-4776] Fix merge into use unresolved assignment (apache#6589)

* [HUDI-4795] Fix KryoException when bulk insert into a not bucket index hudi table

Co-authored-by: hbg <bingeng.huang@shopee.com>

* [HUDI-4615] Return checkpoint as null for empty data from events queue.  (apache#6387)


Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4782] Support TIMESTAMP_LTZ type for flink (apache#6607)

* [HUDI-4731] Shutdown CloudWatch reporter when query completes (apache#6468)

* [HUDI-4793] Fixing ScalaTest tests to properly respect Log4j2 configs (apache#6617)

* [HUDI-4766] Strengthen flink clustering job (apache#6566)

* Allow rollbacks if required during clustering
* Allow size to be defined in Long instead of Integer
* Fix bug where clustering will produce files of 120MB in the same filegroup
* Added clean task
* Fix scheduling config to be consistent with that with compaction
* Fix filter mode getting ignored issue
* Add --instant-time parameter
* Prevent no execute() calls exception from being thrown (clustering & compaction)

* Apply upstream changes

* Fix compilation issues

* Fix checkstyle

Signed-off-by: LinMingQiang <1356469429@qq.com>
Signed-off-by: HunterXHunter <1356469429@qq.com>
Co-authored-by: miomiocat <284487410@qq.com>
Co-authored-by: RexAn <bonean131@gmail.com>
Co-authored-by: JerryYue-M <272614347@qq.com>
Co-authored-by: jerryyue <jerryyue@didiglobal.com>
Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com>
Co-authored-by: superche <73096722+hechao-ustc@users.noreply.github.com>
Co-authored-by: superche <superche@tencent.com>
Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: voonhous <voonhousu@gmail.com>
Co-authored-by: voonhou.su <voonhou.su@shopee.com>
Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com>
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: xi chaomin <36392121+xicm@users.noreply.github.com>
Co-authored-by: xicm <xicm@asiainfo.com>
Co-authored-by: ForwardXu <forwardxu315@gmail.com>
Co-authored-by: 董可伦 <dongkelun01@inspur.com>
Co-authored-by: shenjiayu17 <54424149+shenjiayu17@users.noreply.github.com>
Co-authored-by: Lanyuanxiaoyao <lanyuanxiaoyao@gmail.com>
Co-authored-by: KnightChess <981159963@qq.com>
Co-authored-by: 苏承祥 <scx_white@aliyun.com>
Co-authored-by: Kumud Kumar Srivatsava Tirupati <kumudkumartirupati@users.noreply.github.com>
Co-authored-by: xiarixiaoyao <mengtao0326@qq.com>
Co-authored-by: liujinhui <965147871@qq.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: 冯健 <fengjian428@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: HunterXHunter <1356469429@qq.com>
Co-authored-by: Luning (Lucas) Wang <rsl4@foxmail.com>
Co-authored-by: Yann Byron <biyan900116@gmail.com>
Co-authored-by: Tim Brown <tim.brown126@gmail.com>
Co-authored-by: simonsssu <barley0806@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey@infinilake.com>
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
Co-authored-by: Bo Cui <cuibo0108@163.com>
Co-authored-by: Rahil Chertara <rchertar@amazon.com>
Co-authored-by: wenningd <wenningding95@gmail.com>
Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com>
Co-authored-by: Ryan Pifer <rmpifer@umich.edu>
Co-authored-by: Udit Mehrotra <uditme@amazon.com>
Co-authored-by: simonssu <simonssu@tencent.com>
Co-authored-by: Vander <30547463+vanderzh@users.noreply.github.com>
Co-authored-by: Tim Brown <tim@onehouse.ai>
Co-authored-by: Dongwook Kwon <dongwook@amazon.com>
Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com>
Co-authored-by: Shawn Chang <yxchang@amazon.com>
Co-authored-by: 5herhom <35916131+5herhom@users.noreply.github.com>
Co-authored-by: 吴祥平 <408317717@qq.com>
Co-authored-by: root <root@TCN1004532-1.tcent.cn>
Co-authored-by: F7753 <mabiaocas@gmail.com>
Co-authored-by: lewinma <lewinma@tencent.com>
Co-authored-by: shaoxiong.zhan <31836510+microbearz@users.noreply.github.com>
Co-authored-by: Nicholas Jiang <programgeek@163.com>
Co-authored-by: Yonghua Jian_deepnova <47289660@qq.com>
Co-authored-by: leesf <490081539@qq.com>
Co-authored-by: 5herhom <543872547@qq.com>
Co-authored-by: RexXiong <lvshuang.tb@gmail.com>
Co-authored-by: BruceLin <brucekellan@gmail.com>
Co-authored-by: Pratyaksh Sharma <pratyaksh13@gmail.com>
Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com>
Co-authored-by: 吴文池 <wuwenchi@deepexi.com>
Co-authored-by: jian yonghua <jianyonghua@163.com>
Co-authored-by: Xinyao Tian (Richard) <31195026+XinyaoTian@users.noreply.github.com>
Co-authored-by: XinyaoTian <leontian1024@gmail.com>
Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
Co-authored-by: vamshigv <107005799+vamshigv@users.noreply.github.com>
Co-authored-by: feiyang_deepnova <736320652@qq.com>
Co-authored-by: linfey <linfey2021@gmail.com>
Co-authored-by: novisfff <62633257+novisfff@users.noreply.github.com>
Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com>
Co-authored-by: hehuiyuan <471627698@qq.com>
Co-authored-by: Zouxxyy <zouxxyy@qq.com>
Co-authored-by: Teng <teng_huo@outlook.com>
Co-authored-by: leandro-rouberte <37634317+leandro-rouberte@users.noreply.github.com>
Co-authored-by: Jon Vexler <jbvexler@gmail.com>
Co-authored-by: smilecrazy <smilecrazy1h@gmail.com>
Co-authored-by: xxhua <xxhua@freewheel.tv>
Co-authored-by: komao <masterwangzx@gmail.com>
Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
Co-authored-by: felixYyu <felix2003@live.cn>
Co-authored-by: Bingeng Huang <304979636@qq.com>
Co-authored-by: hbg <bingeng.huang@shopee.com>
Co-authored-by: Vinish Reddy <vinishreddygunner17@gmail.com>
Co-authored-by: junyuc25 <10862251+junyuc25@users.noreply.github.com>
Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
vinishjail97 added a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
* [HUDI-4354] Add --force-empty-sync flag to deltastreamer (apache#6027)

* [HUDI-4601] Read error from MOR table after compaction with timestamp partitioning (apache#6365)

* read error from mor after compaction

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [MINOR] Update DOAP with 0.12.0 Release (apache#6413)

* [HUDI-4529] Tweak some default config options for flink (apache#6287)

* [HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)

* [HUDI-4551] Tweak the default parallelism of flink pipeline to execution env  parallelism (apache#6312)

* [MINOR] Improve code style of CLI Command classes (apache#6427)

* [HUDI-3625] Claim RFC-60 for Federated Storage Layer (apache#6440)

* [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar (apache#6386)

- Adding PulsarSource to DeltaStreamer to support ingesting from Apache Pulsar.
- Current implementation of PulsarSource is relying on "pulsar-spark-connector" to ingest using Spark instead of building similar pipeline from scratch.

* [HUDI-3579] Add timeline commands in hudi-cli (apache#5139)

* [HUDI-4638] Rename payload clazz and preCombine field options for flink sql (apache#6434)

* Revert "[HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)" (apache#6449)

This reverts commit 9055b2f.

* [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set (apache#6443)

* [HUDI-4644] Change default flink profile to 1.15.x (apache#6445)

* [HUDI-4678] Claim RFC-61 for Snapshot view management (apache#6461)

Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC (apache#6459)

* [HUDI-4676] infer cleaner policy when write concurrency mode is OCC
Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4683] Use enum class value for default value in flink options (apache#6453)

* [HUDI-4584] Cleaning up Spark utilities (apache#6351)

Cleans up Spark utilities and removes duplication

* [HUDI-4686] Flip option 'write.ignore.failed' to default false (apache#6467)

Also fix the flaky test

* [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (apache#6267)

* [HUDI-4637] Release thread in RateLimiter doesn't been terminated (apache#6433)

* [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (apache#6481)

* HUDI-4687 add show_invalid_parquet procedure (apache#6480)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [HUDI-4584] Fixing `SQLConf` not being propagated to executor (apache#6352)

Fixes `HoodieSparkUtils.createRDD` to make sure `SQLConf` is properly propagated to the executor (required by `AvroSerializer`)

* [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies (apache#6170)

* [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false (apache#6450)

* [HUDI-4713] Fix flaky ITTestHoodieDataSource#testAppendWrite (apache#6490)

* [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (apache#6494)

* Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)" (apache#6501)

This reverts commit 660177b.

* [Stacked on 6386] Fixing `DebeziumSource` to properly commit offsets; (apache#6416)

* [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer (apache#6111)

* [HUDI-4703] use the historical schema to response time travel query (apache#6499)

* [HUDI-4703] use the historical schema to response time travel query

* [HUDI-4549]  Remove avro from hudi-hive-sync-bundle and hudi-aws-bundle (apache#6472)

* Remove avro shading from hudi-hive-sync-bundle
   and hudi-aws-bundle.

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4482] remove guava and use caffeine instead for cache (apache#6240)

* [HUDI-4483] Fix checkstyle in integ-test module (apache#6523)

* [HUDI-4340] fix not parsable text DateTimeParseException by addng a method parseDateFromInstantTimeSafely for parsing timestamp when output metrics (apache#6000)

* [DOCS] Add docs about javax.security.auth.login.LoginException when starting Hudi Sink Connector (apache#6255)

* [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (apache#6533)

* [HUDI-4730] Fix batch job cannot clean old commits files (apache#6515)

* [HUDI-4370] Fix batch job cannot clean old commits files

Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4740] Add metadata fields for hive catalog #createTable (apache#6541)

* [HUDI-4695] Fixing flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime (apache#6534)

* [HUDI-4193] change protoc version to unblock hudi compilation on m1 mac (apache#6535)

* [HUDI-4438] Fix flaky TestCopyOnWriteActionExecutor#testPartitionMetafileFormat (apache#6546)

* [MINOR] Fix typo in HoodieArchivalConfig (apache#6542)

* [HUDI-4582] Support batch synchronization of partition to HMS to avoid timeout (apache#6347)


Co-authored-by: xxhua <xxhua@freewheel.tv>

* [HUDI-4742] Fix AWS Glue partition's location is wrong when updatePartition (apache#6545)

Co-authored-by: xxhua <xxhua@freewheel.tv>

* [HUDI-4418] Add support for ProtoKafkaSource (apache#6135)

- Adds PROTO to Source.SourceType enum.
- Handles PROTO type in SourceFormatAdapter by converting to Avro from proto Message objects. 
   Conversion to Row goes Proto -> Avro -> Row currently.
- Added ProtoClassBasedSchemaProvider to generate schemas for a proto class that is currently on the classpath.
- Added ProtoKafkaSource which parses byte[] into a class that is on the path.
- Added ProtoConversionUtil which exposes methods for creating schemas and 
   translating from Proto messages to Avro GenericRecords.
- Added KafkaSource which provides a base class for the other Kafka sources to use.

* [HUDI-4642] Adding support to hudi-cli to repair deprecated partition (apache#6438)

* [HUDI-4751] Fix owner instants for transaction manager api callers (apache#6549)

* [HUDI-4739] Wrong value returned when key's length equals 1 (apache#6539)

* extracts key fields

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [HUDI-4528] Add diff tool to compare commit metadata (apache#6485)

* Add diff tool to compare commit metadata
* Add partition level info to commits and compaction command
* Partition support for compaction archived timeline
* Add diff command test

* [HUDI-4648] Support rename partition through CLI (apache#6569)

* [HUDI-4775] Fixing incremental source for MOR table (apache#6587)

* Fixing incremental source for MOR table

* Remove unused import

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [HUDI-4694] Print testcase running time for CI jobs (apache#6586)

* [RFC] Claim RFC-62 for Diagnostic Reporter (apache#6599)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

* [minor] following HUDI-4739, fix the extraction for simple record keys (apache#6594)

* [HUDI-4619] Add a remote request retry mechanism for 'Remotehoodietablefilesystemview'. (apache#6393)

* [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when source not contains meta fields (apache#6500)

Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>

* [HUDI-4389] Make HoodieStreamingSink idempotent (apache#6098)

* Support checkpoint and idempotent writes in HoodieStreamingSink

- Use batchId as the checkpoint key and add to commit metadata
- Support multi-writer for checkpoint data model

* Walk back previous commits until checkpoint is found

* Handle delete operation and fix test

* [MINOR] Remove redundant braces (apache#6604)

* [HUDI-4618] Separate log word for CommitUitls class (apache#6392)

* [HUDI-4776] Fix merge into use unresolved assignment (apache#6589)

* [HUDI-4795] Fix KryoException when bulk insert into a not bucket index hudi table

Co-authored-by: hbg <bingeng.huang@shopee.com>

* [HUDI-4615] Return checkpoint as null for empty data from events queue.  (apache#6387)


Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4782] Support TIMESTAMP_LTZ type for flink (apache#6607)

* [HUDI-4731] Shutdown CloudWatch reporter when query completes (apache#6468)

* [HUDI-4793] Fixing ScalaTest tests to properly respect Log4j2 configs (apache#6617)

* [HUDI-4766] Strengthen flink clustering job (apache#6566)

* Allow rollbacks if required during clustering
* Allow size to be defined in Long instead of Integer
* Fix bug where clustering will produce files of 120MB in the same filegroup
* Added clean task
* Fix scheduling config to be consistent with that with compaction
* Fix filter mode getting ignored issue
* Add --instant-time parameter
* Prevent no execute() calls exception from being thrown (clustering & compaction)

* [HUDI-4797] fix merge into table for source table with different column order (apache#6620)

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>

* [MINOR] Typo fix for kryo in flink-bundle (apache#6639)

* [HUDI-4811] Fix the checkstyle of hudi flink (apache#6633)

* [HUDI-4465] Optimizing file-listing sequence of Metadata Table (apache#6016)

Optimizes file-listing sequence of the Metadata Table to make sure it's on par or better than FS-based file-listing

Change log:

- Cleaned up avoidable instantiations of Hadoop's Path
- Replaced new Path w/ createUnsafePath where possible
- Cached TimestampFormatter, DateFormatter for timezone
- Avoid loading defaults in Hadoop conf when init-ing HFile reader
- Avoid re-instantiating BaseTableMetadata twice w/in BaseHoodieTableFileIndex
- Avoid looking up FileSystem for every partition when listing partitioned table, instead do it just once

* [HUDI-4807] Use base table instant for metadata initialization (apache#6629)

* [HUDI-3453] Fix HoodieBackedTableMetadata concurrent reading issue (apache#5091)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [HUDI-4518] Add unit test for reentrant lock in diff lockProvider (apache#6624)

* [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath (apache#6631)

Downgrading all of the log4j2 deps to "provided" scope, since these are not API modules (as advertised), but rather fully-fledged implementations adding dependency on other modules (like log4j2 in the case of "log4j-1.2-api")

* [HUDI-4826] Update RemoteHoodieTableFileSystemView to allow base path in UTF-8 (apache#6544)

* [HUDI-4763] Allow hoodie read client to choose index (apache#6506)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [DOCS] Fix Slack invite link in README.md (apache#6648)

* [HUDI-3558] Consistent bucket index: bucket resizing (split&merge) & concurrent write during resizing (apache#4958)

RFC-42 implementation
- Implement bucket resizing for consistent hashing index.
- Support concurrent write during bucket resizing.

This change added tests and can be verified as follows:
- The test of the consistent bucket index is enhanced to include the case of bucket resizing.
- Tests of different bucket resizing cases.
- Tests of concurrent resizing, and concurrent writes during resizing.

* [MINOR] Add dev setup and spark 3.3 profile to readme (apache#6656)

* [HUDI-4831] Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invoke correct api (apache#6637)

Co-authored-by: Rahil Chertara <rchertar@amazon.com>

* [HUDI-4806] Use Avro version from the root pom for Flink bundle (apache#6628)

Co-authored-by: Shawn Chang <yxchang@amazon.com>

* [HUDI-4833] Add Postgres Schema Name to Postgres Debezium Source (apache#6616)

* [HUDI-4825] Remove redundant fields in serialized commit metadata in JSON (apache#6646)

* [MINOR] Insert should call validateInsertSchema in HoodieFlinkWriteClient (apache#5919)

Co-authored-by: 徐帅 <xushuai@MacBook-Pro-6.local>

* [HUDI-3879] Suppress exceptions that are not fatal in HoodieMetadataTableValidator (apache#5344)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning (apache#5478)

- The last completed commit timestamp is used to calculate how many commit have been completed since the last clean. we might need to save this w/ clean plan so that next time when we trigger clean, we can start calculating from that.

* [HUDI-3994] - Added support for initializing DeltaStreamer without a defined Spark Master (apache#5630)

That will enable the usage of DeltaStreamer on environments such
as AWS Glue or other serverless environments where the spark master is
inherited and we do not have access to it.

Co-authored-by: Angel Conde Manjon <acmanjon@amazon.com>

* [HUDI-4628] Hudi-flink support GLOBAL_BLOOM,GLOBAL_SIMPLE,BUCKET index type (apache#6406)

Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

* [HUDI-4814] Schedules new clustering plan based on latest clustering instant (apache#6574)

* Keep a clustering running at the same time
* Simplify filtering logic

Co-authored-by: dongsj <dongsj@asiainfo.com>

* [HUDI-4817] Delete markers after full-record bootstrap operation (apache#6667)

* [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module (apache#6550)

As part of adding support for Spark 3.3 in Hudi 0.12, a lot of the logic 
from Spark 3.2 module has been simply copied over.

This PR is rectifying that by:
1. Creating new module "hudi-spark3.2plus-common" 
    (that is shared across Spark 3.2 and Spark 3.3)
2. Moving shared components under "hudi-spark3.2plus-common"

* [HUDI-4752] Add dedup support for MOR table in cli (apache#6608)

* [HUDI-4837] Stop sleeping where it is not necessary after the success (apache#6270)

Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4843] Delete the useless timer in BaseRollbackActionExecutor (apache#6671)

Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

* [HUDI-4780] hoodie.logfile.max.size It does not take effect, causing the log file to be too large (apache#6602)

* hoodie.logfile.max.size It does not take effect, causing the log file to be too large

Co-authored-by: 854194341@qq.com <loukey_7821>

* [HUDI-4844] Skip partition value resolving when the field does not exists for MergeOnReadInputFormat#getReader (apache#6678)

* [MINOR] Fix the Spark job status description for metadata-only bootstrap operation (apache#6666)

* [HUDI-3403] Ensure keygen props are set for bootstrap (apache#6645)

* [HUDI-4193] Upgrade Protobuf to 3.21.5 (apache#5784)

* [HUDI-4785] Fix partition discovery in bootstrap operation (apache#6673)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type (apache#6486)

InternalSchemaChangeApplier#applyAddChange forget to remove parent name when calling ColumnAddChange#addColumns

* [HUDI-4851] Fixing CSI not handling `InSet` operator properly (apache#6685)

* [HUDI-4796] MetricsReporter stop bug (apache#6619)

* [HUDI-3861] update tblp 'path' when rename table (apache#5320)

* [HUDI-4853] Get field by name for OverwriteNonDefaultsWithLatestAvroPayload to avoid schema mismatch (apache#6689)

* [HUDI-4813] Fix infer keygen not work in sparksql side issue (apache#6634)

* [HUDI-4813] Fix infer keygen not work in sparksql side issue

Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

* [HUDI-4856] Missing option for HoodieCatalogFactory (apache#6693)

* [HUDI-4864] Fix AWSDmsAvroPayload#combineAndGetUpdateValue when using MOR snapshot query after delete operations with test (apache#6688)

Co-authored-by: Rahil Chertara <rchertar@amazon.com>

* [HUDI-4841] Fix sort idempotency issue (apache#6669)

* [HUDI-4865] Optimize HoodieAvroUtils#isMetadataField to use O(1) complexity (apache#6702)

* [HUDI-4736] Fix inflight clean action preventing clean service to continue when multiple cleans are not allowed (apache#6536)

* [HUDI-4842] Support compaction strategy based on delta log file num (apache#6670)

Co-authored-by: 苏承祥 <sucx@tuya.com>

* [HUDI-4282] Repair IOException in CHDFS when check block corrupted in HoodieLogFileReader (apache#6031)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4757] Create pyspark examples (apache#6672)

* [HUDI-3959] Rename class name for spark rdd reader (apache#5409)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4828] Fix the extraction of record keys which may be cut out (apache#6650)

Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4873] Report number of messages to be processed via metrics (apache#6271)

Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4870] Improve compaction config description (apache#6706)

* [HUDI-3304] Support partial update payload (apache#4676)


Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… (apache#6630)

* [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in log file issue

Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

* [HUDI-4485] Bump spring shell to 2.1.1 in CLI (apache#6489)

Bumped spring shell to 2.1.1 and updated the default 
value for show fsview all `pathRegex` parameter.

* [minor] following 3304, some code refactoring (apache#6713)

* [HUDI-4832] Fix drop partition meta sync (apache#6662)

* [HUDI-4810] Fix log4j imports to use bridge API  (apache#6710)


Co-authored-by: dongsj <dongsj@asiainfo.com>

* [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (apache#6717)


Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

* [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (apache#5920)

- This pull request fix [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 apache#5861*
- The issue is caused by after changing the table to spark data source table, the table SerDeInfo is missing. *

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [MINOR] fix indent to make build pass (apache#6721)

* [HUDI-3478] Implement CDC Write in Spark (apache#6697)

* [HUDI-4326] Fix hive sync serde properties (apache#6722)

* [HUDI-4875] Fix NoSuchTableException when dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2 (apache#6709)

* [DOCS] Improve the quick start guide for Kafka Connect Sink (apache#6708)

* [HUDI-4729] Fix file group pending compaction cannot be queried when query _ro table (apache#6516)

File group in pending compaction can not be queried 
when query _ro table with spark. This commit fixes that.

Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

* [HUDI-3983] Fix ClassNotFoundException when using hudi-spark-bundle to write table with hbase index (apache#6715)

* [HUDI-4758] Add validations to java spark examples (apache#6615)

* [HUDI-4792] Batch clean files to delete (apache#6580)

This  patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition.
This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions.
Fixes issue apache#6373

Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4363] Support Clustering row writer to improve performance (apache#6046)

* [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data (apache#6734)

* [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator (apache#6739)


Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-3901] Correct the description of hoodie.index.type (apache#6749)

* [MINOR] Add .mvn directory to gitignore (apache#6746)

Co-authored-by: Rahil Chertara <rchertar@amazon.com>

* add support for unraveling proto schemas

* fix some compile issues

* [HUDI-4901] Add avro.version to Flink profiles (apache#6757)

* Add avro.version to Flink profiles

Co-authored-by: Shawn Chang <yxchang@amazon.com>

* [HUDI-4559] Support hiveSync command based on Call Produce Command (apache#6322)

* [HUDI-4883] Supporting delete savepoint for MOR (apache#6744)

Users could delete unnecessary savepoints 
and unblock archival for MOR table.

* [HUDI-4897] Refactor the merge handle in CDC mode (apache#6740)

* [HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)

* Revert "[HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)" (apache#6768)

This reverts commit 092375f.

* [HUDI-3523] Introduce AddPrimitiveColumnSchemaPostProcessor to support add new primitive column to the end of a schema (apache#6769)

* [HUDI-4903] Fix TestHoodieLogFormat`s minor typo (apache#6762)

* [MINOR] Drastically reducing concurrency level (to avoid CI flakiness) (apache#6754)

* Update HoodieIndex.java

Fix a typo

* [HUDI-4906] Fix the local tests for hudi-flink (apache#6763)

* [HUDI-4899] Fixing compatibility w/ Spark 3.2.2 (apache#6755)

* [HUDI-4892] Fix hudi-spark3-bundle (apache#6735)

* [MINOR] Fix a few typos in HoodieIndex (apache#6784)

Co-authored-by: xingjunwang <xingjunwang@tencent.com>

* [HUDI-4412] Fix multi writer INSERT_OVERWRITE NPE bug (apache#6130)

There are two minor issues fixed here:

1. When the insert_overwrite operation is performed, the 
    clusteringPlan in the requestedReplaceMetadata will be 
    null. Calling getFileIdsFromRequestedReplaceMetadata will cause NPE.

2. When insert_overwrite operation, inflightCommitMetadata!=null, 
    getOperationType should be obtained from getHoodieInflightReplaceMetadata,
    the original code will have a null pointer.

* [MINOR] retain avro's namespace (apache#6783)

* [MINOR] Simple logging fix in LockManager (apache#6765)

Co-authored-by: 苏承祥 <sucx@tuya.com>

* [HUDI-4433] hudi-cli repair deduplicate not working with non-partitioned dataset (apache#6349)

When using the repair deduplicate command with hudi-cli, 
there is no way to run it on the unpartitioned dataset, 
so modify the cli parameter.

Co-authored-by: Xingjun Wang <wongxingjun@126.com>

* [RFC-51][HUDI-3478] Update RFC: CDC support (apache#6256)

* [HUDI-4915] improve avro serializer/deserializer (apache#6788)

* [HUDI-3478] Implement CDC Read in Spark (apache#6727)

* naming and style updates

* [HUDI-4830] Fix testNoGlobalConfFileConfigured when add hudi-defaults.conf in default dir (apache#6652)

* make test data random, reuse code

* [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering (apache#6561)

- Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD which ended up retriggering the action. Fixing the double de-referencing in this patch.

* [HUDI-4914] Managed memory weight should be set when sort clustering is enabled (apache#6792)

* [HUDI-4910] Fix unknown variable or type "Cast" (apache#6778)

* [HUDI-4918] Fix bugs about when trying to show the non -existing key from env, NullPointException occurs. (apache#6794)

* [HUDI-4718] Add Kerberos kinit command support. (apache#6719)

* add test for 2 different recursion depths, fix schema cache key

* add unsigned long support

* better handle other types

* rebase on 4904

* get all tests working

* fix oneof expected schema, update tests after rebase

* [HUDI-4902] Set default partitioner for SIMPLE BUCKET index (apache#6759)

* [MINOR] Update PR template with documentation update (apache#6748)

* revert scala binary change

* try a different method to avoid avro version

* [HUDI-4904] Add support for unraveling proto schemas in ProtoClassBasedSchemaProvider (apache#6761)

If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data.

Main changes to existing code:

ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal
once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array
Converting from a proto to an avro now accounts for this truncation of the input

* delete unused file

* [HUDI-4907] Prevent single commit multi instant issue (apache#6766)


Co-authored-by: TengHuo <teng_huo@outlook.com>
Co-authored-by: yuzhao.cyz <yuzhao.cyz@gmail.com>

* [HUDI-4923] Fix flaky TestHoodieReadClient.testReadFilterExistAfterBulkInsertPrepped (apache#6801)



Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4848] Fixing repair deprecated partition tool (apache#6731)

* [HUDI-4913] Fix HoodieSnapshotExporter for writing to a different S3 bucket or FS (apache#6785)

* address PR feedback, update decimal precision

* fix isNullable issue, check if class is Int64value

* checkstyle fix

* change wrapper descriptor set initialization

* add in testing for unsigned long to BigInteger conversion

* [HUDI-4453] Fix schema to include partition columns in bootstrap operation (apache#6676)

Turn off the type inference of the partition column to be consistent with 
existing behavior. Add notes around partition column type inference.

* [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data (apache#4015)


Co-authored-by: huangjing02 <huangjing02@bilibili.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4924] Auto-tune dedup parallelism (apache#6802)

* [HUDI-4687] Avoid setAccessible which breaks strong encapsulation (apache#6657)

Use JOL GraphLayout for estimating deep size.

* [MINOR] fixing validate async operations to poll completed clean instances (apache#6814)

* [HUDI-4734] Deltastreamer table config change validation (apache#6753)


Co-authored-by: sivabalan <n.siva.b@gmail.com>

* [HUDI-4934] Revert batch clean files (apache#6813)

* Revert "[HUDI-4792] Batch clean files to delete (apache#6580)"
This reverts commit cbf9b83.

* [HUDI-4722] Added locking metrics for Hudi (apache#6502)

* [HUDI-4936] Fix `as.of.instant` not recognized as hoodie config (apache#5616)


Co-authored-by: leon <leon@leondeMacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4861] Relaxing `MERGE INTO` constraints to permit limited casting operations w/in matched-on conditions (apache#6820)

* [HUDI-4885] Adding org.apache.avro to hudi-hive-sync bundle (apache#6729)

* [HUDI-4951] Fix incorrect use of Long.getLong() (apache#6828)

* [MINOR] Use base path URI in ITTestDataStreamWrite (apache#6826)

* [HUDI-4308] READ_OPTIMIZED read mode will temporary loss of data when compaction (apache#6664)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [HUDI-4237] Fixing empty partition-values being sync'd to HMS (apache#6821)

Co-authored-by: dujunling <dujunling@bytedance.com>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>

* [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (apache#6355)


Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-4850] Add incremental source from GCS to Hudi (apache#6665)

Adds an incremental source from GCS based on a similar design 
as https://hudi.apache.org/blog/2021/08/23/s3-events-source

* [HUDI-4957] Shade JOL in bundles to fix NoClassDefFoundError:GraphLayout (apache#6839)

* [HUDI-4718] Add Kerberos kdestroy command support (apache#6810)

* [HUDI-4916] Implement change log feed for Flink (apache#6840)

* [HUDI-4769] Option read.streaming.skip_compaction skips delta commit (apache#6848)

* [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row (apache#6805)

* [HUDI-4966] Add a partition extractor to handle partition values with slashes (apache#6851)

* [MINOR] Fix testUpdateRejectForClustering (apache#6852)

* [HUDI-4962] Move cloud dependencies to cloud modules (apache#6846)

* [HOTFIX] Fix source release validate script (apache#6865)

* [HUDI-4980] Calculate avg record size using commit only (apache#6864)

Calculate average record size for Spark upsert partitioner 
based on commit instants only. Previously it's based on 
commit and replacecommit, of which the latter may be 
created by clustering which has inaccurately smaller 
average record sizes, which could result in OOM 
due to size underestimation.

* shade protobuf dependency

* Revert "[HUDI-4915] improve avro serializer/deserializer (apache#6788)" (apache#6809)

This reverts commit 79b3e2b.

* [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (apache#6857)

* Enhancing README for multi-writer tests (apache#6870)

* [MINOR] Fix deploy script for flink 1.15 (apache#6872)

* [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (apache#6883)

* Revert "shade protobuf dependency"

This reverts commit f03f961.

* [HUDI-4972] Fixes to make unit tests work on m1 mac (apache#6751)

* [HUDI-2786] Docker demo on mac aarch64 (apache#6859)

* [HUDI-4971] Fix shading kryo-shaded with reusing configs (apache#6873)

* [HUDI-3900] [UBER] Support log compaction action for MOR tables (apache#5958)

- Adding log compaction support to MOR table. subsequent log blocks can now be compacted into larger log blocks without needing to go for full compaction (by merging w/ base file). 
- New timeline action is introduced for the purpose. 

Co-authored-by: sivabalan <n.siva.b@gmail.com>

* Relocate apache http package (apache#6874)

* [HUDI-4975] Fix datahub bundle dependency (apache#6896)

* [HUDI-4999] Refactor FlinkOptions#allOptions and CatalogOptions#allOptions (apache#6901)

* [MINOR] Update GitHub setting for merge button (apache#6922)

Only allow squash and merge. Disable merge and rebase

* [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool (apache#6885)

* [MINOR] Fix name spelling for RunBootstrapProcedure

* [HUDI-4754] Add compliance check in github actions (apache#6575)

* [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion (apache#6847)


Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>

* [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities (apache#6886)

* Implement Create/Drop/Show/Refresh Secondary Index (apache#5933)

* remove oss pr compliance

* different approach for shutdown all metrics instances

* remove flink testing, update metrics shutdown

Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com>
Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com>
Co-authored-by: 吴文池 <wuwenchi@deepexi.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com>
Co-authored-by: Nicholas Jiang <programgeek@163.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: Alexey Kudinkin <alexey@infinilake.com>
Co-authored-by: 董可伦 <dongkelun01@inspur.com>
Co-authored-by: 冯健 <fengjian428@gmail.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: hehuiyuan <471627698@qq.com>
Co-authored-by: Zouxxyy <zouxxyy@qq.com>
Co-authored-by: Manu <36392121+xicm@users.noreply.github.com>
Co-authored-by: shaoxiong.zhan <31836510+microbearz@users.noreply.github.com>
Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com>
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com>
Co-authored-by: Yann Byron <biyan900116@gmail.com>
Co-authored-by: KnightChess <981159963@qq.com>
Co-authored-by: Teng <teng_huo@outlook.com>
Co-authored-by: leandro-rouberte <37634317+leandro-rouberte@users.noreply.github.com>
Co-authored-by: Jon Vexler <jbvexler@gmail.com>
Co-authored-by: smilecrazy <smilecrazy1h@gmail.com>
Co-authored-by: xxhua <xxhua@freewheel.tv>
Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com>
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: HunterXHunter <1356469429@qq.com>
Co-authored-by: komao <masterwangzx@gmail.com>
Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
Co-authored-by: felixYyu <felix2003@live.cn>
Co-authored-by: Bingeng Huang <304979636@qq.com>
Co-authored-by: hbg <bingeng.huang@shopee.com>
Co-authored-by: Vinish Reddy <vinishreddygunner17@gmail.com>
Co-authored-by: junyuc25 <10862251+junyuc25@users.noreply.github.com>
Co-authored-by: voonhous <voonhousu@gmail.com>
Co-authored-by: Xingcan Cui <xcui@wealthsimple.com>
Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com>
Co-authored-by: wangp-nhlab <95683046+wangp-nhlab@users.noreply.github.com>
Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net>
Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com>
Co-authored-by: Rahil Chertara <rchertar@amazon.com>
Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com>
Co-authored-by: Shawn Chang <yxchang@amazon.com>
Co-authored-by: Abhishek Modi <modi@makenotion.com>
Co-authored-by: shuai.xu <chiggics@gmail.com>
Co-authored-by: 徐帅 <xushuai@MacBook-Pro-6.local>
Co-authored-by: Angel Conde <neuw84@gmail.com>
Co-authored-by: Angel Conde Manjon <acmanjon@amazon.com>
Co-authored-by: FocusComputing <xiaoxingstack@gmail.com>
Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>
Co-authored-by: eric9204 <90449228+eric9204@users.noreply.github.com>
Co-authored-by: dongsj <dongsj@asiainfo.com>
Co-authored-by: Volodymyr Burenin <vburenin@gmail.com>
Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com>
Co-authored-by: luokey <loukey.j@gmail.com>
Co-authored-by: Sylwester Lachiewicz <slachiewicz@apache.org>
Co-authored-by: 苏承祥 <scx_white@aliyun.com>
Co-authored-by: 苏承祥 <sucx@tuya.com>
Co-authored-by: 5herhom <543872547@qq.com>
Co-authored-by: Jon Vexler <jon@onehouse.ai>
Co-authored-by: simonsssu <barley0806@gmail.com>
Co-authored-by: y0908105023 <283999377@qq.com>
Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com>
Co-authored-by: Paul Zhang <xzhangyao@126.com>
Co-authored-by: Kyle Zhike Chen <zk.chan007@gmail.com>
Co-authored-by: dohongdayi <dohongdayi@126.com>
Co-authored-by: RexAn <bonean131@gmail.com>
Co-authored-by: ForwardXu <forwardxu315@gmail.com>
Co-authored-by: wangxianghu <wangxianghu@apache.org>
Co-authored-by: wulei <wulei.1023@bytedance.com>
Co-authored-by: Xingjun Wang <wongxingjun@126.com>
Co-authored-by: Prasanna Rajaperumal <prasanna.raj@live.com>
Co-authored-by: xingjunwang <xingjunwang@tencent.com>
Co-authored-by: liujinhui <965147871@qq.com>
Co-authored-by: ChanKyeong Won <brightwon.dev@gmail.com>
Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com>
Co-authored-by: hj2016 <hj3245459@163.com>
Co-authored-by: huangjing02 <huangjing02@bilibili.com>
Co-authored-by: jsbali <jsbali@uber.com>
Co-authored-by: Leon Tsao <31072303+gnailJC@users.noreply.github.com>
Co-authored-by: leon <leon@leondeMacBook-Pro.local>
Co-authored-by: 申胜利 <48829688+shenshengli@users.noreply.github.com>
Co-authored-by: aiden.dong <782112163@qq.com>
Co-authored-by: dujunling <dujunling@bytedance.com>
Co-authored-by: Pramod Biligiri <pramodbiligiri@gmail.com>
Co-authored-by: Zouxxyy <zouxinyu.zxy@alibaba-inc.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Co-authored-by: Surya Prasanna <syalla@uber.com>
Co-authored-by: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com>
Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: huberylee <shibei.lh@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants