Disable bootstrap precombine #1

a49a · 2023-02-16T09:28:12Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…x type (apache#6406) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

…instant (apache#6574) * Keep a clustering running at the same time * Simplify filtering logic Co-authored-by: dongsj <dongsj@asiainfo.com>

…che#6667)

…e#6550) As part of adding support for Spark 3.3 in Hudi 0.12, a lot of the logic from Spark 3.2 module has been simply copied over. This PR is rectifying that by: 1. Creating new module "hudi-spark3.2plus-common" (that is shared across Spark 3.2 and Spark 3.3) 2. Moving shared components under "hudi-spark3.2plus-common"

…apache#6270) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…pache#6671) Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

…the log file to be too large (apache#6602) * hoodie.logfile.max.size It does not take effect, causing the log file to be too large Co-authored-by: 854194341@qq.com <loukey_7821>

…ists for MergeOnReadInputFormat#getReader (apache#6678)

…rap operation (apache#6666)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…dd nest type (apache#6486) InternalSchemaChangeApplier#applyAddChange forget to remove parent name when calling ColumnAddChange#addColumns

…#6685)

…ayload to avoid schema mismatch (apache#6689)

…6634) * [HUDI-4813] Fix infer keygen not work in sparksql side issue Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

… MOR snapshot query after delete operations with test (apache#6688) Co-authored-by: Rahil Chertara <rchertar@amazon.com>

…lexity (apache#6702)

…tinue when multiple cleans are not allowed (apache#6536)

…pache#6670) Co-authored-by: 苏承祥 <sucx@tuya.com>

… HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…pache#6650) Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…ache#6271) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

…pache#7393)

…pache#7360)

If someone has enabled schema on read by mistake and never really renamed or dropped a column. it should be feasible to disable schema on read. This patch fixes that. essentially both on read and write path, if "hoodie.schema.on.read.enable" config is not set, it will fallback to regular code path. It might fail or users might miss data if any they have performed any irrevocable changes like renames. But for rest, this should work.

…ileIndex (apache#7450)

…ate (apache#7170)

…tive (apache#7207)

Before the patch, when there are partial failover within the write tasks, the write task current instant was initialized as the latest inflight instant, the write task then waits for a new instant to write with so hangs and failover continuously. For a task recovered from failover (with attempt number greater than 0), the latest inflight instant can actually be reused, the intermediate data files can be cleaned with MARGER files post commit.

Make spark3.3 profile to upgrade from Spark 3.3.0 to 3.3.1 (HUDI-4871) Make spark3.2 profile to upgrade from Spark 3.2.1 to 3.2.3 (HUDI-4411)

…ig (apache#7069) Revert to FSUtils.getAllPartitionPaths to load partitions properly. Details in apache#6016 (comment) Only for 0.12.2 to keep behavior consistent over patch releases

* [HUDI-5007] Prevent Hudi from reading the entire timeline's when performing a LATEST streaming read (apache#6920) (cherry picked from commit 6baf733) * [HUDI-5228] Flink table service job fs view conf overwrites the one of writing job (apache#7214) (cherry picked from commit dc5cc08) Co-authored-by: voonhous <voonhousu@gmail.com>

…ata (apache#7320) (apache#7462) Co-authored-by: just-JL <jiliang1993@gmail.com>

…pache#7464) * [HUDI-5366] Closing metadata writer from within writeClient (apache#7437) * Closing metadata writer from within writeClient * Close metadata writer in flink client Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * Fixing build failure * Fixing flink metadata writer usages Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

…ark (apache#7399) (apache#7465) (cherry picked from commit 86d1e39)

… to HoodieROTablePathFilter (apache#7088) * Add the feature flag back to disable HoodieFileIndex and fall back to HoodieROTablePathFilter * Turn off hoodie.file.index.enable by default to test CI * Add tests for Spark datasource with the fallback to HoodieROTablePathFilter

…ableFileIndex (apache#7488) Currently, on the reader or query engine side, the direct file listing on the file system is used by default, as indicated by HoodieMetadataConfig.DEFAULT_METADATA_ENABLE_FOR_READERS (=false). Without providing explicit config of hoodie.metadata.enable, the metadata-table-based file listing is disabled. However, the BaseHoodieTableFileIndex, the common File Index implementation, used by Trino Hive connector, does not respect this default behavior. This leads to performance regression of query latency in Trino Hive connector, due to way of how the connector is integrated with the Input Format and the File Index with metadata table enabled. This PR fixes the BaseHoodieTableFileIndex to respect the expected behavior defined by HoodieMetadataConfig.DEFAULT_METADATA_ENABLE_FOR_READERS, i.e., metadata-table-based file listing is disabled by default. The metadata-table-based file listing is only enabled when hoodie.metadata.enable is set to true and the files partition of the metadata table is ready for read based on the Hudi table config. Impact This mitigates the performance regression of query latency in Trino Hive connector and fixes the read-side behavior of the file listing. Tested the PR that by default, the HoodieParquetInputFormat does not read metadata table for file listing anymore. Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com>

apache#7493) - This PR falls back to the original code path using fs view cache as in 0.10.1 or earlier, instead of creating file index. - Query engines using initial InputFormat based integration will not be using file index. Instead directly fetch file status from fs view cache.

TJX2014 and others added 30 commits September 22, 2022 18:49

[HUDI-4628] Hudi-flink support GLOBAL_BLOOM，GLOBAL_SIMPLE，BUCKET inde…

6d37d06

…x type (apache#6406) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

[HUDI-4814] Schedules new clustering plan based on latest clustering …

f13ec25

…instant (apache#6574) * Keep a clustering running at the same time * Simplify filtering logic Co-authored-by: dongsj <dongsj@asiainfo.com>

[HUDI-4817] Delete markers after full-record bootstrap operation (apa…

16328a8

…che#6667)

[HUDI-4752] Add dedup support for MOR table in cli (apache#6608)

95ac58e

[HUDI-4837] Stop sleeping where it is not necessary after the success (…

0afc581

…apache#6270) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-4843] Delete the useless timer in BaseRollbackActionExecutor (a…

d614242

…pache#6671) Co-authored-by: 吴文池 <wuwenchi@deepexi.com>

[HUDI-4780] hoodie.logfile.max.size It does not take effect, causing …

34c6ce9

…the log file to be too large (apache#6602) * hoodie.logfile.max.size It does not take effect, causing the log file to be too large Co-authored-by: 854194341@qq.com <loukey_7821>

[HUDI-4844] Skip partition value resolving when the field does not ex…

394ebb3

…ists for MergeOnReadInputFormat#getReader (apache#6678)

[MINOR] Fix the Spark job status description for metadata-only bootst…

59c85a4

…rap operation (apache#6666)

[HUDI-3403] Ensure keygen props are set for bootstrap (apache#6645)

17a23da

[HUDI-4193] Upgrade Protobuf to 3.21.5 (apache#5784)

73d8758

[HUDI-4785] Fix partition discovery in bootstrap operation (apache#6673)

ebdb9f4

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to a…

ec62884

…dd nest type (apache#6486) InternalSchemaChangeApplier#applyAddChange forget to remove parent name when calling ColumnAddChange#addColumns

[HUDI-4851] Fixing CSI not handling InSet operator properly (apache…

b7687d9

…#6685)

[HUDI-4796] MetricsReporter stop bug (apache#6619)

c758a96

[HUDI-3861] update tblp 'path' when rename table (apache#5320)

8e3c365

[HUDI-4853] Get field by name for OverwriteNonDefaultsWithLatestAvroP…

be67657

…ayload to avoid schema mismatch (apache#6689)

[HUDI-4813] Fix infer keygen not work in sparksql side issue (apache#…

ba77748

…6634) * [HUDI-4813] Fix infer keygen not work in sparksql side issue Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com>

[HUDI-4856] Missing option for HoodieCatalogFactory (apache#6693)

f7a6637

[HUDI-4864] Fix AWSDmsAvroPayload#combineAndGetUpdateValue when using…

702e40c

… MOR snapshot query after delete operations with test (apache#6688) Co-authored-by: Rahil Chertara <rchertar@amazon.com>

[HUDI-4841] Fix sort idempotency issue (apache#6669)

9b886d1

[HUDI-4865] Optimize HoodieAvroUtils#isMetadataField to use O(1) comp…

6572693

…lexity (apache#6702)

[HUDI-4736] Fix inflight clean action preventing clean service to con…

7db20e9

…tinue when multiple cleans are not allowed (apache#6536)

[HUDI-4842] Support compaction strategy based on delta log file num (a…

8b21bb7

…pache#6670) Co-authored-by: 苏承祥 <sucx@tuya.com>

[HUDI-4282] Repair IOException in CHDFS when check block corrupted in…

a084dd6

… HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-4757] Create pyspark examples (apache#6672)

c11537b

[HUDI-3959] Rename class name for spark rdd reader (apache#5409)

fd0a953

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-4828] Fix the extraction of record keys which may be cut out (a…

3db232e

…pache#6650) Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-4873] Report number of messages to be processed via metrics (ap…

094ed24

…ache#6271) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

yihua and others added 30 commits December 14, 2022 10:52

[HUDI-5336] Fixing parsing of log files while building file groups (a…

01ba650

…pache#7393)

[HUDI-5372] Fix NPE caused by alter table add column. (apache#7236)

5e31223

[MINOR] Fix Out of Bounds Exception for DayBasedCompactionStrategy (a…

8f5723b

…pache#7360)

[HUDI-5353] Close file readers (apache#7412)

106d216

[HUDI-5078] Fixing isTableService for replace commits (apache#7037)

1743d05

[HUDI-5348] Cache file slices in HoodieBackedTableMetadata (apache#7436)

4fd25ca

Avoiding costly lookups into the schema cache in SqlTypedRecord

292630b

Fixing schemas used for bootstrap reader

ee8c9df

[HUDI-5375] Fixing reusing file readers with Metadata reader within F…

738f673

…ileIndex (apache#7450)

[HUDI-5187] Remove the preCondition check of BucketAssigner assign st…

d9a4d21

…ate (apache#7170)

[HUDI-5221] Make the decision for flink sql bucket index case-insensi…

6d5ea00

…tive (apache#7207)

[HUDI-4871][HUDI-4411] Upgrade to spark 3.3.1 & 3.2.2 (apache#7155)

6fa192a

Make spark3.3 profile to upgrade from Spark 3.3.0 to 3.3.1 (HUDI-4871) Make spark3.2 profile to upgrade from Spark 3.2.1 to 3.2.3 (HUDI-4411)

[HUDI-5097] Fix partition reading without partition fields table conf…

492f7d7

…ig (apache#7069) Revert to FSUtils.getAllPartitionPaths to load partitions properly. Details in apache#6016 (comment) Only for 0.12.2 to keep behavior consistent over patch releases

[HUDI-5290] Remove the lock in HoodieFlinkWriteClient#writeTableMetad…

1f0b2dd

…ata (apache#7320) (apache#7462) Co-authored-by: just-JL <jiliang1993@gmail.com>

[HUDI-3661] Flink async compaction is not thread safe when use waterm…

1061de2

…ark (apache#7399) (apache#7465) (cherry picked from commit 86d1e39)

[HUDI-5251] Split GitHub actions CI by spark and flink (apache#7265)

3fb6d33

[HUDI-5357] Optimize deployment of release artifacts (apache#7419)

0a53080

Fixing build failures

de51917

Bumping mvn version to 0.12.2-1

94db72e

[HUDI-5357] Fix release build commands (apache#7501)

975eb91

[MINOR] Update release version to reflect published version 0.12.2

aea5bb6

Disable precombine field in bootstrap

205996f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable bootstrap precombine #1

Disable bootstrap precombine #1

a49a commented Feb 16, 2023

Disable bootstrap precombine #1

Are you sure you want to change the base?

Disable bootstrap precombine #1

Conversation

a49a commented Feb 16, 2023

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist