[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350

davies · 2015-04-03T16:02:13Z

This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte.

This PR should not break any public API, Row.getString() will still return java.lang.String.

This is the first step of improve the performance of String in SQL.

cc @rxin

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala

SparkQA · 2015-04-03T17:36:54Z

Test build #29676 has finished for PR 5350 at commit 8d17f21.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class MutableString extends MutableValue
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

marmbrus · 2015-04-03T19:42:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificMutableRow.scala

+    case s: String =>
+      // for tests
+      throw new Exception("String should be converted into UTF8String")
+    case other => values(ordinal).update(value)


I'm not actually sure, but I wonder if it would be faster to do a plain null check followed by an asInstanceOf, rather than pattern matching. You will still get a class cast exception in tests if there is a mistake.

SparkQA · 2015-04-03T20:28:24Z

Test build #29689 has finished for PR 5350 at commit fd11364.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class MutableString extends MutableValue
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

marmbrus · 2015-04-03T21:02:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/UTF8String.scala

+
+final class UTF8String extends Ordered[UTF8String] with Serializable {
+
+  private[this] var bytes: Array[Byte] = _


I had assumed that we would want to use bytes: Array[Byte] + length: Int so that the same byte array could be reused multiple times for different values. It seems that allocating and zeroing out the byte arrays could actually be pretty expensive.

Ah, okay talked to @rxin and we are going to try and do this later?

Right now, UTF8String will take bytes from Binary.getBytes or String.getBytes, no copy, until we call copy() explicitly.

What is the Binary that you are referring to at here? Also, can you explain what do you mean by no copy at here?

The Binary is parquet.io.api.Binary, When we create a UTFString from Binary.getBytes, we does not need to do another copy for bytes.

Before this patch, we will create a copy as String.

davies · 2015-04-10T20:09:20Z

@marmbrus Could you take a final look on this?

SparkQA · 2015-04-13T18:45:04Z

Test build #30178 has finished for PR 5350 at commit 6ce7c0b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2015-04-13T18:54:55Z

Test build #30181 has finished for PR 5350 at commit 744788f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

SparkQA · 2015-04-13T21:09:11Z

Test build #30187 has finished for PR 5350 at commit 341ec2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

SparkQA · 2015-04-14T21:12:09Z

Test build #670 has started for PR 5350 at commit 341ec2c.

marmbrus · 2015-04-14T22:49:47Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

+   *  java.lang.String -> UTF8String
+   *  java.lang.Decimal -> Decimal
+   */
+  def needConversion: Boolean = true


We should comment that the internal representation is not stable across releases and thus data sources outside of Spark SQL should leave this as true.

davies · 2015-04-15T00:37:15Z

@marmbrus done.

SparkQA · 2015-04-15T02:06:11Z

Test build #30286 has finished for PR 5350 at commit 59025c8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

yhuai · 2015-04-15T03:22:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala

    var idx = 0
-    while (idx < row.size) {
+    while (idx < converters.size && idx < row.size) {


Why the number of converters is different with the number of fields in a row?

For HiveQuerySuite "Add JAR command 2", the size of row is 1, but the number of fields is 0, maybe Hive 1.3 has changed the result?

can we change size to length? otherwise it allocates a new object for arrays

btw, where is "ADD JAR command 2"? I could not find it...

It's a new test case in master, I had to merged with master to debug it.

rxin · 2015-04-15T04:31:34Z

BTW - since this changes so many files, it'd be great to merge this as soon as possible. We can fix minor problems later in follow up PRs.

SparkQA · 2015-04-15T04:55:28Z

Test build #30297 has finished for PR 5350 at commit 2772f0d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

yhuai · 2015-04-15T05:18:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala

+    val schema = StructType(
+      StructField("result", IntegerType, false) :: Nil)
+    schema.toAttributes
+  }


I do not really know the reason that the result of AddJar is a Row(0) (see a few lines below.). But, we can figure it out after we merge it.

OK, the reason is to match the behavior of Hive... This change looks good.

yhuai · 2015-04-15T05:25:57Z

LGTM

SparkQA · 2015-04-15T06:56:15Z

Test build #30303 has finished for PR 5350 at commit 3b7bfa8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait CaseConversionExpression
- final class UTF8String extends Ordered[UTF8String] with Serializable
This patch does not change any dependencies.

marmbrus · 2015-04-15T20:07:02Z

Thanks! Merged to master!

There was a bug introduced by apache#5350

Davies Liu added 21 commits March 30, 2015 22:42

use UTF8String instead of String for StringType

685fd07

cleanup

21f67c6

use Array[Byte] in UTF8String

4699c3a

fix utf8 for python api

d32abd1

refactor

a85fb27

fix style

6b499ac

fix sql tests

5f9e120

fix python sql tests

38c303e

fix some catalyst tests

c7dd4d2

fix scala style

bb52e44

fix codegen with UTF8String

8b45864

refactor

23a766c

fix some hive tests

9dc32d1

fix hive tests

956b0a4

convert data type for data source

9f4c194

some comment about Date

537631c

refactor

28d6f32

remove clone in UTF8String

e5fa5b8

fix hive compatibility tests

8d17f21

optimize UTF8String

fd11364

marmbrus reviewed Apr 3, 2015
View reviewed changes

address comment

ac18ae6

marmbrus reviewed Apr 3, 2015
View reviewed changes

davies force-pushed the string branch from 6ce7c0b to 744788f Compare April 13, 2015 18:50

turn off scala style check in UTF8StringSuite

341ec2c

marmbrus reviewed Apr 14, 2015
View reviewed changes

address comments from @marmbrus

59025c8

Davies Liu added 2 commits April 14, 2015 19:51

Merge branch 'master' of github.com:apache/spark into string

6d776a9

fix new test failure

2772f0d

yhuai reviewed Apr 15, 2015
View reviewed changes

fix schema of AddJar

3b7bfa8

yhuai reviewed Apr 15, 2015
View reviewed changes

yhuai mentioned this pull request Apr 15, 2015

[SPARK-5794] [SQL] fix add jar #4586

Closed

asfgit closed this in 8584276 Apr 15, 2015

scwf mentioned this pull request Apr 19, 2015

[SPARK-6997][SQL] Convert StringType in LocalTableScan #5579

Closed

rxin added a commit to rxin/spark that referenced this pull request Apr 24, 2015

[SQL] Fixed expression data type matching.

336a36d

There was a bug introduced by apache#5350

rxin mentioned this pull request Apr 24, 2015

[SQL] Fixed expression data type matching. #5675

Closed

rayortigas mentioned this pull request Apr 27, 2015

[SPARK-7160][SQL] Support converting DataFrames to typed RDDs. #5713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350

[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350

davies commented Apr 3, 2015

SparkQA commented Apr 3, 2015

marmbrus Apr 3, 2015

SparkQA commented Apr 3, 2015

marmbrus Apr 3, 2015

marmbrus Apr 3, 2015

davies Apr 3, 2015

yhuai Apr 8, 2015

davies Apr 8, 2015

davies commented Apr 10, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 14, 2015

marmbrus Apr 14, 2015

davies commented Apr 15, 2015

SparkQA commented Apr 15, 2015

yhuai Apr 15, 2015

davies Apr 15, 2015

rxin Apr 15, 2015

yhuai Apr 15, 2015

davies Apr 15, 2015

rxin commented Apr 15, 2015

SparkQA commented Apr 15, 2015

yhuai Apr 15, 2015

yhuai Apr 15, 2015

yhuai commented Apr 15, 2015

SparkQA commented Apr 15, 2015

marmbrus commented Apr 15, 2015


		final class UTF8String extends Ordered[UTF8String] with Serializable {

		private[this] var bytes: Array[Byte] = _

[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350

[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350

Conversation

davies commented Apr 3, 2015

SparkQA commented Apr 3, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davies commented Apr 10, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 14, 2015

Choose a reason for hiding this comment

davies commented Apr 15, 2015

SparkQA commented Apr 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Apr 15, 2015

SparkQA commented Apr 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented Apr 15, 2015

SparkQA commented Apr 15, 2015

marmbrus commented Apr 15, 2015