[SPARK-26450][SQL] Avoid rebuilding map of schema for every column in projection #23392

bersprockets · 2018-12-27T22:35:19Z

What changes were proposed in this pull request?

When creating some unsafe projections, Spark rebuilds the map of schema attributes once for each expression in the projection. Some file format readers create one unsafe projection per input file, others create one per task. ProjectExec also creates one unsafe projection per task. As a result, for wide queries on wide tables, Spark might build the map of schema attributes hundreds of thousands of times.

This PR changes two functions to reuse the same AttributeSeq instance when creating BoundReference objects for each expression in the projection. This avoids the repeated rebuilding of the map of schema attributes.

Benchmarks

The time saved by this PR depends on size of the schema, size of the projection, number of input files (or number of file splits), number of tasks, and file format. I chose a couple of example cases.

In the following tests, I ran the query

select * from table where id1 = 1

Matching rows are about 0.2% of the table.

Orc table 6000 columns, 500K rows, 34 input files

baseline	pr	improvement
1.772306 min	1.487267 min	16.082943%

Orc table 6000 columns, 500K rows, 17 input files

baseline	pr	improvement
1.656400 min	1.423550 min	14.057595%

Orc table 60 columns, 50M rows, 34 input files

baseline	pr	improvement
0.299878 min	0.290339 min	3.180926%

Parquet table 6000 columns, 500K rows, 34 input files

baseline	pr	improvement
1.478306 min	1.373728 min	7.074165%

Note: The parquet reader does not create an unsafe projection. However, the filter operation in the query causes the planner to add a ProjectExec, which does create an unsafe projection for each task. So these results have nothing to do with Parquet itself.

Parquet table 60 columns, 50M rows, 34 input files

baseline	pr	improvement
0.245006 min	0.242200 min	1.145099%

CSV table 6000 columns, 500K rows, 34 input files

baseline	pr	improvement
2.390117 min	2.182778 min	8.674844%

CSV table 60 columns, 50M rows, 34 input files

baseline	pr	improvement
1.520911 min	1.510211 min	0.703526%

How was this patch tested?

SQL unit tests
Python core and SQL test

SparkQA · 2018-12-28T00:23:11Z

Test build #100481 has finished for PR 23392 at commit 6b66711.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-12-28T00:39:30Z

retest this please

SparkQA · 2018-12-28T02:23:16Z

Test build #100483 has finished for PR 23392 at commit 6b66711.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-12-28T02:49:10Z

Getting java.lang.NoClassDefFoundError: javax/jdo/JDOException trying to instantiate HiveMetaStoreClient during HiveClientSuites. Common error... so I will try again.

bersprockets · 2018-12-28T02:49:21Z

retest this please

SparkQA · 2018-12-28T06:17:29Z

Test build #100484 has finished for PR 23392 at commit 6b66711.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-28T07:24:19Z

retest this please

SparkQA · 2018-12-28T08:05:01Z

Test build #100490 has finished for PR 23392 at commit 6b66711.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-12-28T14:49:03Z

retest this please

SparkQA · 2018-12-28T16:54:56Z

Test build #100503 has finished for PR 23392 at commit 6b66711.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-12-28T17:01:34Z

retest this please

SparkQA · 2018-12-28T20:55:57Z

Test build #100507 has finished for PR 23392 at commit 6b66711.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-12-28T21:42:37Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

-  protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] =
-    in.map(BindReferences.bindReference(_, inputSchema))
+  protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = {
+    lazy val inputSchemaAttrSeq: AttributeSeq = inputSchema


Why the lazy val? Are you optimizing for the case where in is empty?

Yes, that is the reason. For example, the query df.count, where df is a dataframe from a CSV datasource, calls GenerateUnsafeProjection.bind with am empty list of expressions.

However, the map inside the AttributeSeq object is not built until someone accesses exprIdToOrdinal, so maybe it is overkill.

what about changing the signature of bind instead? This would be helpful also to ensure that we don't miss this fix in other parts of the code IMHO.

@mgaido91 Do you mean change it to this?:

bind(in: Seq[Expression], inputSchema: AttributeSeq): Seq[Expression]

+1 on something that eliminates this issue wholesale.

hvanhovell · 2018-12-28T21:44:19Z

@bersprockets looks pretty good. Are there any other places where we should apply this? If there are then we should consider introducing a helper function.

bersprockets · 2018-12-28T23:14:49Z

@hvanhovell There are other places that use this pattern:

InterpretedProjection
InterpretedOrdering
HashAggregateExec
FileFormatWriter
HashJoin
SortMergeJoinExec
GenerateSafeProjection
GenerateOrdering
GenerateMutableProjection

However, I don't know if any are in a "hot" path.

bersprockets · 2018-12-28T23:49:52Z

@hvanhovell For a helper function, if needed, I was thinking object BindReferences would be a good place for it (named BindReferences.bindReferences, as a companion to the singular BindReferences.bindReference).

mgaido91 · 2018-12-30T11:32:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

@@ -89,7 +89,8 @@ package object expressions  {
   * A helper function to bind given expressions to an input schema.
   */
  def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = {


bersprockets · 2018-12-31T19:53:18Z

I am working to replace the dozen or so cases of seq1.map(BindReferences.bindReference(_, seq2)) with a call to a helper function.

SparkQA · 2019-01-02T07:47:34Z

Test build #100632 has finished for PR 23392 at commit a25b59c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-01-02T10:13:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

-    exprs.map(BindReferences.bindReference(_, inputSchema))
+  def toBoundExprs[A <: Expression](
+    exprs: Seq[A],
+    inputSchema: Seq[Attribute]): Seq[A] = {


why not just changing this to AttributeSeq?

Sometimes this function is called with a zero-length exprs (df.count, for example). I am attempting to avoid constructing the AttributeSeq in that case, because AttributeSeq's constructor eagerly builds a data structure based on the attributes (private val qualified3Part).

How expensive is it to build those structures on empty data? We could consider cachin an empty attribute seq and use that when the there are no attributes.

I was thinking that just because exprs is empty, that would not necessarily mean inputSchema is empty. But experiments seem to indicate that inputSchema is also empty. So building the AttributeSeq would be extremely low-cost. I guess there is no reason to lazily build it.

mgaido91 · 2019-01-02T10:14:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

@@ -393,7 +393,8 @@ case class SortMergeJoinExec(
      input: Seq[Attribute]): Seq[ExprCode] = {
    ctx.INPUT_ROW = row
    ctx.currentVars = null
-    keys.map(BindReferences.bindReference(_, input).genCode(ctx))
+    val inputAttributeSeq: AttributeSeq = input
+    keys.map(BindReferences.bindReference(_, inputAttributeSeq).genCode(ctx))


why not toBoundExprs(keys, input).map(_.genCode(ctx))?

It should be...

mgaido91

LGTM, just one style comment

mgaido91 · 2019-01-02T20:51:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

@@ -88,7 +88,9 @@ package object expressions  {
  /**
   * A helper function to bind given expressions to an input schema.
   */
-  def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = {
+  def toBoundExprs[A <: Expression](
+    exprs: Seq[A],


nit: indent

SparkQA · 2019-01-02T21:31:56Z

Test build #100645 has finished for PR 23392 at commit b977d3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-02T23:24:12Z

Test build #100661 has finished for PR 23392 at commit 1497d3a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2019-01-02T23:27:41Z

retest this please

bersprockets · 2019-01-03T03:13:08Z

@mgaido91 @hvanhovell I am looking at a small oddity with one of the benchmark cases (Orc 60 cols, 50M rows) that got introduced sometime after the initial commit.

So if anyone felt inclined to merge this, please hold off for now.

If there are more review comments, that's good.

SparkQA · 2019-01-03T03:25:26Z

Test build #100662 has finished for PR 23392 at commit 1497d3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2019-01-03T10:32:10Z

@bersprockets did you audit the entire codebase for this pattern? From a cursory search I could also see that ProjectExec does something similar, and that is not covered by this PR.

hvanhovell · 2019-01-03T10:38:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

@@ -88,7 +88,9 @@ package object expressions  {
  /**
   * A helper function to bind given expressions to an input schema.
   */
-  def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = {
+  def toBoundExprs[A <: Expression](


I would like to minimize the chance that future changes suffer from the same issue. In order to do that we should provide API in a logical place, it does not make a whole lot of sense to me that I need to look in package.scala to find a more performant version of BindReferences.bindReference(..) for a seq. Can we move this function to BindReference and name it bindReferences?

bersprockets · 2019-01-03T18:56:43Z

@hvanhovell

From a cursory search I could also see that ProjectExec does something similar, and that is not covered by this PR.

Ahh.. good catch. My search did not include a parameter name, e.g. the x in map(x =>, since many just use underscore.

mgaido91 · 2019-01-05T22:41:46Z