Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44396][Connect] Direct Arrow Deserialization #42011

Closed
wants to merge 20 commits into from

Conversation

hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented Jul 14, 2023

What changes were proposed in this pull request?

This PR adds direct arrow to user object deserialization to the Spark Connect Scala Client.

Why are the changes needed?

We want to decouple the scala client from catalyst. We need a way to encode user object from and to arrrow.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests to ArrowEncoderSuite.

# Conflicts:
#	connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala
#	connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderUtils.scala
#	connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/arrow/ArrowEncoderSuite.scala
@hvanhovell
Copy link
Contributor Author

Scala 2.13 compilation is still broken, and I need to add a couple of tests.

Copy link
Contributor

@zhenlineo zhenlineo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor doc suggestions to help understand the code. They can be addressed later.

Can you also update the PR description with the purpose of the PR?
Is this mostly for a better structure of how to decode arrow to our native types e.g. Map, Array etc.? Do we expect performance gain too from the change? Dose our own serdes make it faster to obtain the user objects too? If so, can you add some doc in the code too?

arrowSchema = reader.schema
stop |= stopOnArrowSchema
} else if (arrowSchema != reader.schema) {
// Uh oh...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe throw an IllegalStateException or assert something rather than quietly drop? Or we need to at least doc when this happens, which schema to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The only issue is that this incredibly difficult to test.

}

override def close(): Unit = SparkResult.this.close()
override def close(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if not yet read any data using hasNext or next, the data is not consumed? Will the data be dropped? Is the cleaner cleaning the resources in a background thread?

@github-actions github-actions bot added the BUILD label Jul 18, 2023
@hvanhovell
Copy link
Contributor Author

Merging this. Test failure is unrelated.

hvanhovell added a commit that referenced this pull request Jul 19, 2023
### What changes were proposed in this pull request?
This PR adds direct arrow to user object deserialization to the Spark Connect Scala Client.

### Why are the changes needed?
We want to decouple the scala client from catalyst. We need a way to encode user object from and to arrrow.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added tests to `ArrowEncoderSuite`.

Closes #42011 from hvanhovell/SPARK-44396.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 5939b75)
Signed-off-by: Herman van Hovell <herman@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?
This PR adds direct arrow to user object deserialization to the Spark Connect Scala Client.

### Why are the changes needed?
We want to decouple the scala client from catalyst. We need a way to encode user object from and to arrrow.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added tests to `ArrowEncoderSuite`.

Closes apache#42011 from hvanhovell/SPARK-44396.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants