Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350

Closed
wants to merge 36 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Apr 3, 2015

This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte.

This PR should not break any public API, Row.getString() will still return java.lang.String.

This is the first step of improve the performance of String in SQL.

cc @rxin

Davies Liu added 21 commits March 30, 2015 22:42
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
	sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
	sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
	sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala
@SparkQA
Copy link

SparkQA commented Apr 3, 2015

Test build #29676 has finished for PR 5350 at commit 8d17f21.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class MutableString extends MutableValue
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

case s: String =>
// for tests
throw new Exception("String should be converted into UTF8String")
case other => values(ordinal).update(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure, but I wonder if it would be faster to do a plain null check followed by an asInstanceOf, rather than pattern matching. You will still get a class cast exception in tests if there is a mistake.

@SparkQA
Copy link

SparkQA commented Apr 3, 2015

Test build #29689 has finished for PR 5350 at commit fd11364.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class MutableString extends MutableValue
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.


final class UTF8String extends Ordered[UTF8String] with Serializable {

private[this] var bytes: Array[Byte] = _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that we would want to use bytes: Array[Byte] + length: Int so that the same byte array could be reused multiple times for different values. It seems that allocating and zeroing out the byte arrays could actually be pretty expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay talked to @rxin and we are going to try and do this later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, UTF8String will take bytes from Binary.getBytes or String.getBytes, no copy, until we call copy() explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Binary that you are referring to at here? Also, can you explain what do you mean by no copy at here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Binary is parquet.io.api.Binary, When we create a UTFString from Binary.getBytes, we does not need to do another copy for bytes.

Before this patch, we will create a copy as String.

@davies
Copy link
Contributor Author

davies commented Apr 10, 2015

@marmbrus Could you take a final look on this?

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30178 has finished for PR 5350 at commit 6ce7c0b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala
	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30181 has finished for PR 5350 at commit 744788f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30187 has finished for PR 5350 at commit 341ec2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #670 has started for PR 5350 at commit 341ec2c.

* java.lang.String -> UTF8String
* java.lang.Decimal -> Decimal
*/
def needConversion: Boolean = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should comment that the internal representation is not stable across releases and thus data sources outside of Spark SQL should leave this as true.

@davies
Copy link
Contributor Author

davies commented Apr 15, 2015

@marmbrus done.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30286 has finished for PR 5350 at commit 59025c8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

var idx = 0
while (idx < row.size) {
while (idx < converters.size && idx < row.size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the number of converters is different with the number of fields in a row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HiveQuerySuite "Add JAR command 2", the size of row is 1, but the number of fields is 0, maybe Hive 1.3 has changed the result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change size to length? otherwise it allocates a new object for arrays

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, where is "ADD JAR command 2"? I could not find it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a new test case in master, I had to merged with master to debug it.

@rxin
Copy link
Contributor

rxin commented Apr 15, 2015

BTW - since this changes so many files, it'd be great to merge this as soon as possible. We can fix minor problems later in follow up PRs.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30297 has finished for PR 5350 at commit 2772f0d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

val schema = StructType(
StructField("result", IntegerType, false) :: Nil)
schema.toAttributes
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really know the reason that the result of AddJar is a Row(0) (see a few lines below.). But, we can figure it out after we merge it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the reason is to match the behavior of Hive... This change looks good.

@yhuai
Copy link
Contributor

yhuai commented Apr 15, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30303 has finished for PR 5350 at commit 3b7bfa8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait CaseConversionExpression
    • final class UTF8String extends Ordered[UTF8String] with Serializable
  • This patch does not change any dependencies.

@marmbrus
Copy link
Contributor

Thanks! Merged to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants