Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26159] Codegen for LocalTableScanExec and RDDScanExec #23127

Closed
wants to merge 9 commits into from

Conversation

juliuszsompolski
Copy link
Contributor

@juliuszsompolski juliuszsompolski commented Nov 23, 2018

What changes were proposed in this pull request?

Implement codegen for LocalTableScanExec and ExistingRDDExec. Refactor to share code between LocalTableScanExec, ExistingRDDExec, InputAdapter and RowDataSourceScanExec.

The difference in doProduce between these four was that ExistingRDDExec and RowDataSourceScanExec triggered adding an UnsafeProjection, while InputAdapter and LocalTableScanExec did not.

In the new trait InputRDDCodegen I added a flag createUnsafeProjection which the operators set accordingly.

Note: LocalTableScanExec explicitly creates its input as UnsafeRows, so it was obvious why it doesn't need an UnsafeProjection. But if an InputAdapter may take input that is InternalRows but not UnsafeRows, then I think it doesn't need an unsafe projection just because any other operator that is its parent would do that. That assumes that that any parent operator would always result in some UnsafeProjection being eventually added, and hence the output of the WholeStageCodegen unit would be UnsafeRows. If these assumptions hold, I think createUnsafeProjection could be set to (parent == null).

Note: Do not codegen LocalTableScanExec when it's the only operator. LocalTableScanExec has optimized driver-only executeCollect and executeTake code paths that are used to return Command results without starting Spark Jobs. They can no longer be used if the LocalTableScanExec gets optimized.

How was this patch tested?

Covered and used in existing tests.

@juliuszsompolski
Copy link
Contributor Author

cc @hvanhovell @rednaxelafx

@juliuszsompolski juliuszsompolski changed the title [SPARK-26159] Codegen for LocalTableScanExec and ExistingRDDExec [SPARK-26159] Codegen for LocalTableScanExec and RDDScanExec Nov 23, 2018
@SparkQA
Copy link

SparkQA commented Nov 23, 2018

Test build #99218 has finished for PR 23127 at commit 23c2d91.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait InputRDDCodegen extends CodegenSupport
  • case class InputAdapter(child: SparkPlan) extends UnaryExecNode with InputRDDCodegen

@hvanhovell
Copy link
Contributor

Looks good. One more higher level question that can also be addressed in a follow-up.

/**
* Helper default should stop check code.
*/
def shouldStopCheckCode: String = if (needStopCheck) {
Copy link
Contributor

@cloud-fan cloud-fan Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use it in more places. This can be done in folllowup.

@cloud-fan
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99258 has finished for PR 23127 at commit 23c2d91.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait InputRDDCodegen extends CodegenSupport
  • case class InputAdapter(child: SparkPlan) extends UnaryExecNode with InputRDDCodegen

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99259 has finished for PR 23127 at commit 23c2d91.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait InputRDDCodegen extends CodegenSupport
  • case class InputAdapter(child: SparkPlan) extends UnaryExecNode with InputRDDCodegen

@juliuszsompolski
Copy link
Contributor Author

juliuszsompolski commented Nov 26, 2018

@cloud-fan @rednaxelafx
Actually, the input to a codegen stage can be an internal row so I can't make the inputRDD be RDD[UnsafeRow], but the output needs to be UnsafeRow.
Doing it like InputAdapter did actually make it just pass-through output the internal row.
For InputAdapter, there always is some parent operator to consume it, and create an unsafe projection in whatever it does, and then the output UnsafeRows.
But for an RDDScanExec or RowDataSourceScanExec could be alone in a WholeStageCodegenExec, and then just doing ${consume(ctx, null, row)} made it pass-through output the InternalRow from input.

WDYT about how I patched it up?

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99275 has finished for PR 23127 at commit 170073e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2018

Test build #99286 has finished for PR 23127 at commit 5d94c42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

there are still 2 golden file test failures because of the plan change...

@juliuszsompolski
Copy link
Contributor Author

@cloud-fan Thanks. Actually, I had to revert earlier updates because the plan no longer changes for LocalTableScanExec that is alone in a wholestagecodegen.

@juliuszsompolski
Copy link
Contributor Author

Talked with @hvanhovell offline and set LocalTableScanExec and InputAdapter to not create an unsafe projection, and RDDScanExec and RowDataSourceScanExec to always do so, which replicates previous behaviour.

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99320 has finished for PR 23127 at commit c345e2f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99321 has finished for PR 23127 at commit bdb71d7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor Author

jenkins retest this please

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - @cloud-fan can you take another look?

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99329 has finished for PR 23127 at commit bdb71d7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Nov 27, 2018

jenkins retest this please

@juliuszsompolski It won't help, you need to fix python tests it seems

Failed example:
    df.explain()
Differences (ndiff with -expected +actual):
      == Physical Plan ==
    - Scan ExistingRDD[age#0,name#1]
    + *(1) Scan ExistingRDD[age#0,name#1]
    ? +++++

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99339 has finished for PR 23127 at commit 3668f97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 8c68718 Nov 28, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Implement codegen for `LocalTableScanExec` and `ExistingRDDExec`. Refactor to share code between `LocalTableScanExec`, `ExistingRDDExec`, `InputAdapter` and `RowDataSourceScanExec`.

The difference in `doProduce` between these four was that `ExistingRDDExec` and `RowDataSourceScanExec` triggered adding an `UnsafeProjection`, while `InputAdapter` and `LocalTableScanExec` did not.

In the new trait `InputRDDCodegen` I added a flag `createUnsafeProjection` which the operators set accordingly.

Note: `LocalTableScanExec` explicitly creates its input as `UnsafeRows`, so it was obvious why it doesn't need an `UnsafeProjection`. But if an `InputAdapter` may take input that is `InternalRows` but not `UnsafeRows`, then I think it doesn't need an unsafe projection just because any other operator that is its parent would do that. That assumes that that any parent operator would always result in some `UnsafeProjection` being eventually added, and hence the output of the `WholeStageCodegen` unit would be `UnsafeRows`. If these assumptions hold, I think `createUnsafeProjection` could be set to `(parent == null)`.

Note: Do not codegen `LocalTableScanExec` when it's the only operator. `LocalTableScanExec` has optimized driver-only `executeCollect` and `executeTake` code paths that are used to return `Command` results without starting Spark Jobs. They can no longer be used if the `LocalTableScanExec` gets optimized.

## How was this patch tested?

Covered and used in existing tests.

Closes apache#23127 from juliuszsompolski/SPARK-26159.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants