Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370

Merged
merged 12 commits into from
Dec 9, 2022

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Dec 2, 2022

Change Logs

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

  • Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
  • Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
  • Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)

Additionally following improvements to HoodieBulkInsertHelper were made:

  • Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

Impact

Should marginally improve performance of both Bulk Insert (row-writing) and CTAS

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added pr:wip Work in Progress/PRs priority:blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022
@alexeykudinkin alexeykudinkin removed the pr:wip Work in Progress/PRs label Dec 7, 2022
@alexeykudinkin alexeykudinkin changed the title [WIP] Fixing Create Table as Select (CTAS) performance gaps [MINOR] Fixing Create Table as Select (CTAS) performance gaps Dec 7, 2022
@@ -30,6 +30,10 @@ import org.apache.spark.util.MutablePair
*/
object HoodieUnsafeUtils {

// TODO scala-doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add doc or remove the comment.

@@ -85,7 +83,8 @@ case class CreateHoodieTableAsSelectCommand(
val newTable = table.copy(
identifier = tableIdentWithDB,
storage = newStorage,
schema = reOrderedQuery.schema,
// TODO add meta-fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be taken up in the PR stacked on top of this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a TODO to follow-up later, not planned for this PR

val metaFieldsStubs = metaFields.map(f => Alias(Literal(UTF8String.EMPTY_UTF8, dataType = StringType), f.name)())
val prependedQuery = Project(metaFieldsStubs ++ query.output, query)

HoodieUnsafeUtils.createDataFrameFrom(df.sparkSession, prependedQuery)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@alexeykudinkin alexeykudinkin requested a review from yihua December 7, 2022 18:41
@alexeykudinkin alexeykudinkin changed the title [MINOR] Fixing Create Table as Select (CTAS) performance gaps [HUDI-5346] Fixing Create Table as Select (CTAS) performance gaps Dec 7, 2022
@@ -92,11 +69,44 @@ object HoodieDatasetBulkInsertHelper extends Logging {

val updatedSchema = StructType(metaFields ++ schema.fields)

val updatedDF = if (populateMetaFields && config.shouldCombineBeforeInsert) {
val dedupedRdd = dedupeRows(prependedRdd, updatedSchema, config.getPreCombineField, SparkHoodieIndexFactory.isGlobalIndex(config))
val updatedDF = if (populateMetaFields) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't change -- simply moved around to avoid dereferencing Dataset into RDD when meta-fields are disabled (we can add them as simple Projection in that case)

storage = newStorage,
schema = reOrderedQuery.schema,
properties = table.properties.--(needFilterProps)
val updatedStorageFormat = table.storage.copy(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying existing code

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few minor comments

val commitSeqNo = UTF8String.EMPTY_UTF8
val filename = UTF8String.EMPTY_UTF8

// TODO use mutable row, avoid re-allocating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Create a JIRA ticket for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are minor things that usually don't warrant a full-blown ticket (leaving these mostly for myself to update later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's for you, could you add TODO(<name_handle>) so that we know you plan to take it up later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, not a big fan of the names in the TODO as it becomes a graveyard of shame plastering someone's name on all these TODOs. This TODO is up for grab by anybody working in this area, but most of the time i usually pick up my own TODO from before and address them in a follow-ups. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In general, we should only limit TODO to tiny stuff, as Jira tickets have better traceability and assignment.

@@ -121,6 +120,7 @@ object HoodieSparkSqlWriter {
}
val tableType = HoodieTableType.valueOf(hoodieConfig.getString(TABLE_TYPE))
var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION))
// TODO clean up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this?

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above -- this one in particular we should clean up as this conditional doesn't really make sense

Comment on lines +95 to +97
// NOTE: Users might be specifying write-configuration (inadvertently) as options or table properties
// in CTAS, therefore we need to make sure that these are appropriately propagated to the
// write operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the write config be specified in a different way with options to avoid mixing write configs with table configs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way you specify they'd turn out in tableProperties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me now.

@alexeykudinkin alexeykudinkin changed the title [HUDI-5346] Fixing Create Table as Select (CTAS) performance gaps [HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps Dec 8, 2022
@hudi-bot
Copy link

hudi-bot commented Dec 8, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin
Copy link
Contributor Author

CI is green:

Screenshot 2022-12-08 at 6 52 14 PM

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=13561&view=results

@alexeykudinkin alexeykudinkin merged commit 8de5357 into apache:master Dec 9, 2022
nsivabalan pushed a commit that referenced this pull request Dec 13, 2022
…ce gaps (#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
alexeykudinkin added a commit that referenced this pull request Dec 14, 2022
…ce gaps (#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker release-0.12.2 Patches targetted for 0.12.2
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants