-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6638] [SQL] Improve performance of StringType in SQL #5350
Conversation
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala
Test build #29676 has finished for PR 5350 at commit
|
case s: String => | ||
// for tests | ||
throw new Exception("String should be converted into UTF8String") | ||
case other => values(ordinal).update(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure, but I wonder if it would be faster to do a plain null check followed by an asInstanceOf, rather than pattern matching. You will still get a class cast exception in tests if there is a mistake.
Test build #29689 has finished for PR 5350 at commit
|
|
||
final class UTF8String extends Ordered[UTF8String] with Serializable { | ||
|
||
private[this] var bytes: Array[Byte] = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had assumed that we would want to use bytes: Array[Byte]
+ length: Int
so that the same byte array could be reused multiple times for different values. It seems that allocating and zeroing out the byte arrays could actually be pretty expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay talked to @rxin and we are going to try and do this later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, UTF8String
will take bytes from Binary.getBytes
or String.getBytes
, no copy, until we call copy() explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the Binary
that you are referring to at here? Also, can you explain what do you mean by no copy
at here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Binary
is parquet.io.api.Binary
, When we create a UTFString
from Binary.getBytes
, we does not need to do another copy for bytes.
Before this patch, we will create a copy as String
.
@marmbrus Could you take a final look on this? |
Test build #30178 has finished for PR 5350 at commit
|
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Test build #30181 has finished for PR 5350 at commit
|
Test build #30187 has finished for PR 5350 at commit
|
Test build #670 has started for PR 5350 at commit |
* java.lang.String -> UTF8String | ||
* java.lang.Decimal -> Decimal | ||
*/ | ||
def needConversion: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should comment that the internal representation is not stable across releases and thus data sources outside of Spark SQL should leave this as true.
@marmbrus done. |
Test build #30286 has finished for PR 5350 at commit
|
var idx = 0 | ||
while (idx < row.size) { | ||
while (idx < converters.size && idx < row.size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the number of converters is different with the number of fields in a row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HiveQuerySuite "Add JAR command 2", the size of row is 1, but the number of fields is 0, maybe Hive 1.3 has changed the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change size to length? otherwise it allocates a new object for arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, where is "ADD JAR command 2"? I could not find it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a new test case in master, I had to merged with master to debug it.
BTW - since this changes so many files, it'd be great to merge this as soon as possible. We can fix minor problems later in follow up PRs. |
Test build #30297 has finished for PR 5350 at commit
|
val schema = StructType( | ||
StructField("result", IntegerType, false) :: Nil) | ||
schema.toAttributes | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really know the reason that the result of AddJar is a Row(0)
(see a few lines below.). But, we can figure it out after we merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the reason is to match the behavior of Hive... This change looks good.
LGTM |
Test build #30303 has finished for PR 5350 at commit
|
Thanks! Merged to master! |
There was a bug introduced by apache#5350
This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte.
This PR should not break any public API, Row.getString() will still return java.lang.String.
This is the first step of improve the performance of String in SQL.
cc @rxin