[SPARK-1368][SQL] Optimized HiveTableScan #758

liancheng · 2014-05-13T16:52:41Z

This PR introduces two major updates:

Replaced FP style code with while loop and reusable GenericMutableRow object in critical path of HiveTableScan.
Using ColumnProjectionUtils to help optimizing RCFile and ORC column pruning.

My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:

Original:

[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms

Optimized:

[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms

The micro benchmark loads a 609MB CVS file (structurally similar to the src test table) into a normal Hive table with LazySimpleSerDe and a RCFile table, then scans these tables respectively.

Preparation code:

package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanPrepare extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  hql("drop table scan_csv")
  hql("drop table scan_rcfile")

  hql("""create table scan_csv (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
        |  with serdeproperties ('field.delim'=',')
      """.stripMargin)

  hql(s"""load data local inpath "${args(0)}" into table scan_csv""")

  hql("""create table scan_rcfile (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
        |stored as
        |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
        |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
      """.stripMargin)

  hql(
    """
      |from scan_csv
      |insert overwrite table scan_rcfile
      |select scan_csv.key, scan_csv.value
    """.stripMargin)
}

Benchmark code:

package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanBenchmark extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  val scanCsv = hql("select key from scan_csv")
  val scanRcfile = hql("select key from scan_rcfile")

  val csvDuration = benchmark(scanCsv.count())
  val rcfileDuration = benchmark(scanRcfile.count())

  println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")

  def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
}

@marmbrus Please help review, thanks!

AmplabJenkins · 2014-05-13T16:52:57Z

Merged build triggered.

AmplabJenkins · 2014-05-13T16:53:05Z

Merged build started.

marmbrus · 2014-05-13T18:03:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala

+
+          case _ =>
+            buffered.map {
+              (_, Array.empty[String])


I think this is allocating a new array every time.

AmplabJenkins · 2014-05-13T18:04:10Z

Merged build finished.

AmplabJenkins · 2014-05-13T18:04:11Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14941/

marmbrus · 2014-05-13T18:05:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala

+            }
+        }
+
+        rowsAndPartitionKeys.map { case (deserializedRow, partitionKeys) =>


I'm curious if there is a cost to pattern matching here instead of using _1 and _2?

Yes there is a Product2[A, B].unapply function call cost. Removing this gains 0.3% speed up.

marmbrus · 2014-05-13T18:16:06Z

Nice speed up! :)

I looked at the test failure. Looks like this TODO is finally coming back to bite us. Instead of looking for any Sort we should walk the tree until we find either a Sort or an operation that doesn't preserve ordering (join , aggregate, etc).

Once we fix that I'd propose merging this right away and then addressing the other possible suggestions in a followup PR.

AmplabJenkins · 2014-05-15T06:32:57Z

Merged build triggered.

AmplabJenkins · 2014-05-15T06:33:04Z

Merged build started.

AmplabJenkins · 2014-05-15T07:48:58Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-15T07:48:59Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15017/

liancheng · 2014-05-15T08:18:55Z

@marmbrus I worked around the test failure by adding a SortedOperation pattern that conservatively matches some definitely sorted operations (false negative rather than false positive). This may slow down the test suite a bit. Since most test output are empty or very small, this shouldn't be an issue right now.

Two new optimizations applied:

Using mutable pairs
Avoiding pattern matching function calls (Array.unapplySeq)

New micro benchmark data:

Original:

[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms

Optimized:

[info] CSV: 12357 ms, RCFile: 9283 ms
[info] CSV: 12291 ms, RCFile: 9298 ms
[info] CSV: 12325 ms, RCFile: 9242 ms

As for Hive data unwrapping, I couldn't find a "static" method to eliminate right now. Any hints?

marmbrus · 2014-05-15T17:52:42Z

@marmbrus I worked around the test failure by adding a SortedOperation pattern that conservatively matches some definitely sorted operations (false negative rather than false positive). This may slow down the test suite a bit. Since most test output are empty or very small, this shouldn't be an issue right now.

I think false negatives are the wrong direction to go here. A false negative means that we think the query is not ordered when it should be and thus are disregarding the order when we should in fact be checking it.

Maybe it would be better to recursively walk the tree looking explicitly for nodes that do not preserve order (aggregation, join, base relations) and then return false. Sorts would return true. Thoughts?

New micro benchmark data:

Sweet, looks like we shaved off a little bit more, so these optimizations were worth it! It would be good to make notes on which changes lead to what kind of speed up here. That way, we can better focus our efforts when we optimize in the future.

As for Hive data unwrapping, I couldn't find a "static" method to eliminate right now. Any hints?

My thought was that you will create an Array of Any => Any functions that can be applied to each column. This way you only match on the datatype once, at the beginning, and then simply index into this array instead of matching for each data item.

AmplabJenkins · 2014-05-18T12:27:57Z

Merged build triggered.

AmplabJenkins · 2014-05-18T12:28:07Z

Merged build started.

AmplabJenkins · 2014-05-18T13:46:43Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-18T13:46:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15069/

liancheng · 2014-05-19T21:42:38Z

@marmbrus Updated HiveComparisonTest and removed SortedOperation, how about this version?

…tions and avoid some sorting cost in HiveComparisonTest.

- Using mutable pairs - Avoiding pattern matching (Array.unapply function calls)

…ly in HiveComparisonTest

AmplabJenkins · 2014-05-28T13:27:58Z

Merged build triggered.

AmplabJenkins · 2014-05-28T13:28:07Z

Merged build started.

AmplabJenkins · 2014-05-28T14:41:34Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-28T14:41:34Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15251/

@marmbrus

JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368) This PR introduces two major updates: - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`. - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning. My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively: ``` Original: [info] CSV: 27676 ms, RCFile: 26415 ms [info] CSV: 27703 ms, RCFile: 26029 ms [info] CSV: 27511 ms, RCFile: 25962 ms Optimized: [info] CSV: 13820 ms, RCFile: 10402 ms [info] CSV: 14158 ms, RCFile: 10691 ms [info] CSV: 13606 ms, RCFile: 10346 ms ``` The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively. Preparation code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanPrepare extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("drop table scan_csv") hql("drop table scan_rcfile") hql("""create table scan_csv (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | with serdeproperties ('field.delim'=',') """.stripMargin) hql(s"""load data local inpath "${args(0)}" into table scan_csv""") hql("""create table scan_rcfile (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' |stored as | inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' | outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' """.stripMargin) hql( """ |from scan_csv |insert overwrite table scan_rcfile |select scan_csv.key, scan_csv.value """.stripMargin) } ``` Benchmark code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanBenchmark extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ val scanCsv = hql("select key from scan_csv") val scanRcfile = hql("select key from scan_rcfile") val csvDuration = benchmark(scanCsv.count()) val rcfileDuration = benchmark(scanRcfile.count()) println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms") def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } } ``` @marmbrus Please help review, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #758 from liancheng/fastHiveTableScan and squashes the following commits: 4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest cf640d8 [Cheng Lian] More HiveTableScan optimisations: bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest. 6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan (cherry picked from commit 8f7141f) Signed-off-by: Michael Armbrust <michael@databricks.com>

marmbrus · 2014-05-29T22:27:42Z

First merge as a committer :)

Thanks for doing this!

This is a follow up of PR #758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.

This is a follow up of PR #758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data. (cherry picked from commit 862283e) Signed-off-by: Michael Armbrust <michael@databricks.com>

@marmbrus

JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368) This PR introduces two major updates: - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`. - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning. My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively: ``` Original: [info] CSV: 27676 ms, RCFile: 26415 ms [info] CSV: 27703 ms, RCFile: 26029 ms [info] CSV: 27511 ms, RCFile: 25962 ms Optimized: [info] CSV: 13820 ms, RCFile: 10402 ms [info] CSV: 14158 ms, RCFile: 10691 ms [info] CSV: 13606 ms, RCFile: 10346 ms ``` The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively. Preparation code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanPrepare extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("drop table scan_csv") hql("drop table scan_rcfile") hql("""create table scan_csv (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | with serdeproperties ('field.delim'=',') """.stripMargin) hql(s"""load data local inpath "${args(0)}" into table scan_csv""") hql("""create table scan_rcfile (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' |stored as | inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' | outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' """.stripMargin) hql( """ |from scan_csv |insert overwrite table scan_rcfile |select scan_csv.key, scan_csv.value """.stripMargin) } ``` Benchmark code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanBenchmark extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ val scanCsv = hql("select key from scan_csv") val scanRcfile = hql("select key from scan_rcfile") val csvDuration = benchmark(scanCsv.count()) val rcfileDuration = benchmark(scanRcfile.count()) println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms") def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } } ``` @marmbrus Please help review, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#758 from liancheng/fastHiveTableScan and squashes the following commits: 4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest cf640d8 [Cheng Lian] More HiveTableScan optimisations: bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest. 6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan

This is a follow up of PR apache#758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR apache#758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.

@marmbrus

JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368) This PR introduces two major updates: - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`. - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning. My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively: ``` Original: [info] CSV: 27676 ms, RCFile: 26415 ms [info] CSV: 27703 ms, RCFile: 26029 ms [info] CSV: 27511 ms, RCFile: 25962 ms Optimized: [info] CSV: 13820 ms, RCFile: 10402 ms [info] CSV: 14158 ms, RCFile: 10691 ms [info] CSV: 13606 ms, RCFile: 10346 ms ``` The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively. Preparation code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanPrepare extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("drop table scan_csv") hql("drop table scan_rcfile") hql("""create table scan_csv (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | with serdeproperties ('field.delim'=',') """.stripMargin) hql(s"""load data local inpath "${args(0)}" into table scan_csv""") hql("""create table scan_rcfile (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' |stored as | inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' | outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' """.stripMargin) hql( """ |from scan_csv |insert overwrite table scan_rcfile |select scan_csv.key, scan_csv.value """.stripMargin) } ``` Benchmark code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanBenchmark extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ val scanCsv = hql("select key from scan_csv") val scanRcfile = hql("select key from scan_rcfile") val csvDuration = benchmark(scanCsv.count()) val rcfileDuration = benchmark(scanRcfile.count()) println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms") def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } } ``` @marmbrus Please help review, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#758 from liancheng/fastHiveTableScan and squashes the following commits: 4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest cf640d8 [Cheng Lian] More HiveTableScan optimisations: bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest. 6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan

This is a follow up of PR apache#758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR apache#758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.

marmbrus reviewed May 13, 2014
View reviewed changes

liancheng added 5 commits May 28, 2014 21:25

[SPARK-1368] Optimized HiveTableScan

eb62fd3

Using ColumnProjectionUtils to optimise RCFile and ORC column pruning

6d1c642

Added SortedOperation pattern to match *some* definitely sorted opera…

bf0e7dc

…tions and avoid some sorting cost in HiveComparisonTest.

More HiveTableScan optimisations:

cf640d8

- Using mutable pairs - Avoiding pattern matching (Array.unapply function calls)

Distinguishes sorted and possibly not sorted operations more accurate…

4241a19

…ly in HiveComparisonTest

asfgit closed this in 8f7141f May 29, 2014

liancheng mentioned this pull request Jun 1, 2014

Avoid dynamic dispatching when unwrapping Hive data. #935

Closed

liancheng mentioned this pull request Jul 19, 2014

[SPARK-2523] [SQL] Hadoop table scan bug fixing #1439

Closed

liancheng mentioned this pull request Dec 30, 2014

SPARK-4963 [SQL] Add copy to SQL's Sample operator #3827

Closed

yanboliang mentioned this pull request Feb 14, 2015

[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1368][SQL] Optimized HiveTableScan #758

[SPARK-1368][SQL] Optimized HiveTableScan #758

liancheng commented May 13, 2014

AmplabJenkins commented May 13, 2014

AmplabJenkins commented May 13, 2014

marmbrus May 13, 2014

AmplabJenkins commented May 13, 2014

AmplabJenkins commented May 13, 2014

marmbrus May 13, 2014

liancheng May 14, 2014

marmbrus commented May 13, 2014

AmplabJenkins commented May 15, 2014

AmplabJenkins commented May 15, 2014

AmplabJenkins commented May 15, 2014

AmplabJenkins commented May 15, 2014

liancheng commented May 15, 2014

marmbrus commented May 15, 2014

AmplabJenkins commented May 18, 2014

AmplabJenkins commented May 18, 2014

AmplabJenkins commented May 18, 2014

AmplabJenkins commented May 18, 2014

liancheng commented May 19, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

marmbrus commented May 29, 2014

[SPARK-1368][SQL] Optimized HiveTableScan #758

[SPARK-1368][SQL] Optimized HiveTableScan #758

Conversation

liancheng commented May 13, 2014

AmplabJenkins commented May 13, 2014

AmplabJenkins commented May 13, 2014

marmbrus May 13, 2014

Choose a reason for hiding this comment

AmplabJenkins commented May 13, 2014

AmplabJenkins commented May 13, 2014

marmbrus May 13, 2014

Choose a reason for hiding this comment

liancheng May 14, 2014

Choose a reason for hiding this comment

marmbrus commented May 13, 2014

AmplabJenkins commented May 15, 2014

AmplabJenkins commented May 15, 2014

AmplabJenkins commented May 15, 2014

AmplabJenkins commented May 15, 2014

liancheng commented May 15, 2014

marmbrus commented May 15, 2014

AmplabJenkins commented May 18, 2014

AmplabJenkins commented May 18, 2014

AmplabJenkins commented May 18, 2014

AmplabJenkins commented May 18, 2014

liancheng commented May 19, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

AmplabJenkins commented May 28, 2014

marmbrus commented May 29, 2014