-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5346][HUDI-5320] Fixing Create Table as Select (CTAS) performance gaps #7370
Conversation
319586a
to
b1c1b23
Compare
b1c1b23
to
6d9c8ae
Compare
@@ -30,6 +30,10 @@ import org.apache.spark.util.MutablePair | |||
*/ | |||
object HoodieUnsafeUtils { | |||
|
|||
// TODO scala-doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add doc or remove the comment.
@@ -85,7 +83,8 @@ case class CreateHoodieTableAsSelectCommand( | |||
val newTable = table.copy( | |||
identifier = tableIdentWithDB, | |||
storage = newStorage, | |||
schema = reOrderedQuery.schema, | |||
// TODO add meta-fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be taken up in the PR stacked on top of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a TODO to follow-up later, not planned for this PR
val metaFieldsStubs = metaFields.map(f => Alias(Literal(UTF8String.EMPTY_UTF8, dataType = StringType), f.name)()) | ||
val prependedQuery = Project(metaFieldsStubs ++ query.output, query) | ||
|
||
HoodieUnsafeUtils.createDataFrameFrom(df.sparkSession, prependedQuery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
6f10d56
to
13e942f
Compare
@@ -92,11 +69,44 @@ object HoodieDatasetBulkInsertHelper extends Logging { | |||
|
|||
val updatedSchema = StructType(metaFields ++ schema.fields) | |||
|
|||
val updatedDF = if (populateMetaFields && config.shouldCombineBeforeInsert) { | |||
val dedupedRdd = dedupeRows(prependedRdd, updatedSchema, config.getPreCombineField, SparkHoodieIndexFactory.isGlobalIndex(config)) | |||
val updatedDF = if (populateMetaFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code doesn't change -- simply moved around to avoid dereferencing Dataset into RDD when meta-fields are disabled (we can add them as simple Projection in that case)
storage = newStorage, | ||
schema = reOrderedQuery.schema, | ||
properties = table.properties.--(needFilterProps) | ||
val updatedStorageFormat = table.storage.copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifying existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few minor comments
val commitSeqNo = UTF8String.EMPTY_UTF8 | ||
val filename = UTF8String.EMPTY_UTF8 | ||
|
||
// TODO use mutable row, avoid re-allocating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Create a JIRA ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are minor things that usually don't warrant a full-blown ticket (leaving these mostly for myself to update later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's for you, could you add TODO(<name_handle>)
so that we know you plan to take it up later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, not a big fan of the names in the TODO as it becomes a graveyard of shame plastering someone's name on all these TODOs. This TODO is up for grab by anybody working in this area, but most of the time i usually pick up my own TODO from before and address them in a follow-ups. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In general, we should only limit TODO to tiny stuff, as Jira tickets have better traceability and assignment.
@@ -121,6 +120,7 @@ object HoodieSparkSqlWriter { | |||
} | |||
val tableType = HoodieTableType.valueOf(hoodieConfig.getString(TABLE_TYPE)) | |||
var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION)) | |||
// TODO clean up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above -- this one in particular we should clean up as this conditional doesn't really make sense
// NOTE: Users might be specifying write-configuration (inadvertently) as options or table properties | ||
// in CTAS, therefore we need to make sure that these are appropriately propagated to the | ||
// write operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the write config be specified in a different way with options
to avoid mixing write configs with table configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way you specify they'd turn out in tableProperties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me now.
6c5c0a0
to
ac07146
Compare
…ce gaps (#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
Change Logs
This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:
InsertIntoTableCommand
will be resolving columns anyway)InsertIntoTableCommand
to first resolve the columns and then run validation (currently it's done the other way around)HoodieSparkSqlWriter
(for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)Additionally following improvements to
HoodieBulkInsertHelper
were made:Dataset
into RDD and instead simply add stubbed out meta-fields t/h additionalProjection
Impact
Should marginally improve performance of both Bulk Insert (row-writing) and CTAS
Risk level (write none, low medium or high below)
Low
Documentation Update
N/A
Contributor's checklist