-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44396][Connect] Direct Arrow Deserialization #42011
Conversation
# Conflicts: # connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
# Conflicts: # connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala # connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderUtils.scala # connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
Scala 2.13 compilation is still broken, and I need to add a couple of tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor doc suggestions to help understand the code. They can be addressed later.
Can you also update the PR description with the purpose of the PR?
Is this mostly for a better structure of how to decode arrow to our native types e.g. Map, Array etc.? Do we expect performance gain too from the change? Dose our own serdes make it faster to obtain the user objects too? If so, can you add some doc in the code too?
arrowSchema = reader.schema | ||
stop |= stopOnArrowSchema | ||
} else if (arrowSchema != reader.schema) { | ||
// Uh oh... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe throw an IllegalStateException or assert something rather than quietly drop? Or we need to at least doc when this happens, which schema to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The only issue is that this incredibly difficult to test.
} | ||
|
||
override def close(): Unit = SparkResult.this.close() | ||
override def close(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if not yet read any data using hasNext
or next
, the data is not consumed? Will the data be dropped? Is the cleaner cleaning the resources in a background thread?
Merging this. Test failure is unrelated. |
### What changes were proposed in this pull request? This PR adds direct arrow to user object deserialization to the Spark Connect Scala Client. ### Why are the changes needed? We want to decouple the scala client from catalyst. We need a way to encode user object from and to arrrow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests to `ArrowEncoderSuite`. Closes #42011 from hvanhovell/SPARK-44396. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 5939b75) Signed-off-by: Herman van Hovell <herman@databricks.com>
### What changes were proposed in this pull request? This PR adds direct arrow to user object deserialization to the Spark Connect Scala Client. ### Why are the changes needed? We want to decouple the scala client from catalyst. We need a way to encode user object from and to arrrow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests to `ArrowEncoderSuite`. Closes apache#42011 from hvanhovell/SPARK-44396. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
This PR adds direct arrow to user object deserialization to the Spark Connect Scala Client.
Why are the changes needed?
We want to decouple the scala client from catalyst. We need a way to encode user object from and to arrrow.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added tests to
ArrowEncoderSuite
.