[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370

alexeykudinkin · 2022-12-02T23:19:43Z

Change Logs

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)

Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

Impact

Should marginally improve performance of both Bulk Insert (row-writing) and CTAS

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

codope · 2022-12-07T15:44:06Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieUnsafeUtils.scala

@@ -30,6 +30,10 @@ import org.apache.spark.util.MutablePair
 */
 object HoodieUnsafeUtils {

+  // TODO scala-doc


Let's add doc or remove the comment.

codope · 2022-12-07T15:44:36Z

...park/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableAsSelectCommand.scala

@@ -85,7 +83,8 @@ case class CreateHoodieTableAsSelectCommand(
    val newTable = table.copy(
      identifier = tableIdentWithDB,
      storage = newStorage,
-      schema = reOrderedQuery.schema,
+      // TODO add meta-fields


Will this be taken up in the PR stacked on top of this one?

Just left a TODO to follow-up later, not planned for this PR

codope · 2022-12-07T15:45:46Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

+      val metaFieldsStubs = metaFields.map(f => Alias(Literal(UTF8String.EMPTY_UTF8, dataType = StringType), f.name)())
+      val prependedQuery = Project(metaFieldsStubs ++ query.output, query)
+
+      HoodieUnsafeUtils.createDataFrameFrom(df.sparkSession, prependedQuery)


alexeykudinkin · 2022-12-07T23:26:48Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

@@ -92,11 +69,44 @@ object HoodieDatasetBulkInsertHelper extends Logging {

    val updatedSchema = StructType(metaFields ++ schema.fields)

-    val updatedDF = if (populateMetaFields && config.shouldCombineBeforeInsert) {
-      val dedupedRdd = dedupeRows(prependedRdd, updatedSchema, config.getPreCombineField, SparkHoodieIndexFactory.isGlobalIndex(config))
+    val updatedDF = if (populateMetaFields) {


This code doesn't change -- simply moved around to avoid dereferencing Dataset into RDD when meta-fields are disabled (we can add them as simple Projection in that case)

alexeykudinkin · 2022-12-07T23:27:18Z

...park/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableAsSelectCommand.scala

-      storage = newStorage,
-      schema = reOrderedQuery.schema,
-      properties = table.properties.--(needFilterProps)
+    val updatedStorageFormat = table.storage.copy(


Simplifying existing code

yihua

LGTM with a few minor comments

yihua · 2022-12-07T23:26:30Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

+            val commitSeqNo = UTF8String.EMPTY_UTF8
+            val filename = UTF8String.EMPTY_UTF8
+
+            // TODO use mutable row, avoid re-allocating


nit: Create a JIRA ticket for this?

These are minor things that usually don't warrant a full-blown ticket (leaving these mostly for myself to update later)

If it's for you, could you add TODO(<name_handle>) so that we know you plan to take it up later?

Frankly, not a big fan of the names in the TODO as it becomes a graveyard of shame plastering someone's name on all these TODOs. This TODO is up for grab by anybody working in this area, but most of the time i usually pick up my own TODO from before and address them in a follow-ups. Does it make sense?

Makes sense. In general, we should only limit TODO to tiny stuff, as Jira tickets have better traceability and assignment.

yihua · 2022-12-07T23:27:11Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

@@ -121,6 +120,7 @@ object HoodieSparkSqlWriter {
    }
    val tableType = HoodieTableType.valueOf(hoodieConfig.getString(TABLE_TYPE))
    var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION))
+    // TODO clean up


nit: remove this?

See my comment above -- this one in particular we should clean up as this conditional doesn't really make sense

yihua · 2022-12-07T23:44:11Z

...park/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableAsSelectCommand.scala

+      // NOTE: Users might be specifying write-configuration (inadvertently) as options or table properties
+      //       in CTAS, therefore we need to make sure that these are appropriately propagated to the
+      //       write operation


Can the write config be specified in a different way with options to avoid mixing write configs with table configs?

Either way you specify they'd turn out in tableProperties

Makes sense to me now.

…r re-ordering

hudi-bot · 2022-12-08T23:16:54Z

CI report:

ac07146 UNKNOWN
5216f66 Azure: FAILURE
07bf93c Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2022-12-09T02:52:24Z

CI is green:

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=13561&view=results

…ce gaps (#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

…ce gaps (#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

nsivabalan added pr:wip Work in Progress/PRs priority:blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022

alexeykudinkin force-pushed the ak/ctas-perf-fix branch from 319586a to b1c1b23 Compare December 6, 2022 23:58

alexeykudinkin removed the pr:wip Work in Progress/PRs label Dec 7, 2022

alexeykudinkin changed the title ~~[WIP] Fixing Create Table as Select (CTAS) performance gaps~~ [MINOR] Fixing Create Table as Select (CTAS) performance gaps Dec 7, 2022

alexeykudinkin force-pushed the ak/ctas-perf-fix branch from b1c1b23 to 6d9c8ae Compare December 7, 2022 01:15

codope assigned nsivabalan and codope Dec 7, 2022

codope reviewed Dec 7, 2022

View reviewed changes

alexeykudinkin requested a review from yihua December 7, 2022 18:41

alexeykudinkin changed the title ~~[MINOR] Fixing Create Table as Select (CTAS) performance gaps~~ [HUDI-5346] Fixing Create Table as Select (CTAS) performance gaps Dec 7, 2022

alexeykudinkin force-pushed the ak/ctas-perf-fix branch from 6f10d56 to 13e942f Compare December 7, 2022 22:29

alexeykudinkin commented Dec 7, 2022

View reviewed changes

yihua approved these changes Dec 7, 2022

View reviewed changes

alexeykudinkin changed the title ~~[HUDI-5346] Fixing Create Table as Select (CTAS) performance gaps~~ [HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps Dec 8, 2022

Alexey Kudinkin added 10 commits December 7, 2022 19:27

Avoid column re-ordering in CTAS

f873adc

Avoid dereferencing into RDD when no meta-fields are actually appended

1a13f38

Tidying up

1da1cc7

Fixing validation inside InsertIntoHoodieTableCommand to apply afte…

84fc6f1

…r re-ordering

Propagate table-properties down to HoodieSparkSqlWriter

583491c

Fixing tests

473fc2f

Tidying up

65b0327

Tidying up in CTAS

654ff62

Tidying up

834771e

Fixing tests

ac07146

alexeykudinkin force-pushed the ak/ctas-perf-fix branch from 6c5c0a0 to ac07146 Compare December 8, 2022 03:27

Alexey Kudinkin added 2 commits December 7, 2022 19:31

After rebase fixes

5216f66

Fixing output column resolution seq for Spark 2.x

07bf93c

alexeykudinkin merged commit 8de5357 into apache:master Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370

[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370

alexeykudinkin commented Dec 2, 2022 •

edited

Loading

codope Dec 7, 2022

codope Dec 7, 2022

alexeykudinkin Dec 7, 2022

codope Dec 7, 2022

alexeykudinkin Dec 7, 2022

alexeykudinkin Dec 7, 2022

yihua left a comment

yihua Dec 7, 2022

alexeykudinkin Dec 8, 2022

yihua Dec 8, 2022

alexeykudinkin Dec 8, 2022

yihua Dec 14, 2022

yihua Dec 7, 2022

alexeykudinkin Dec 8, 2022 •

edited

Loading

yihua Dec 7, 2022

alexeykudinkin Dec 8, 2022

yihua Dec 8, 2022

hudi-bot commented Dec 8, 2022

alexeykudinkin commented Dec 9, 2022

[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370

[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370

Conversation

alexeykudinkin commented Dec 2, 2022 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Dec 8, 2022

CI report:

alexeykudinkin commented Dec 9, 2022

alexeykudinkin commented Dec 2, 2022 •

edited

Loading

alexeykudinkin Dec 8, 2022 •

edited

Loading