-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44398][CONNECT] Scala foreachBatch API #41969
Conversation
cc: @bogao007 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, left some questions regarding to the hack.
* Handles setting up Scala remote session and other Spark Connect environment and then | ||
* runs the provided foreachBatch function `fn`. | ||
* | ||
* HACK ALERT: This version does not atually set up Spark connect. Directly passes the DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 major questions regarding this:
- Is the missing part about setting up a Spark Connect session and converting the legacy DataFrame to Spark Connect DataFrame and being executed inside the Spark Connect session? Do we have any Scala example on setting up Spark Connect session on server side and use it?
- When is
getDataFrameOrThrow()
being called? Is it only needed for Python or do we need to get the DataFrame by ID inside the Spark Connect session for Scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, it is about setting up spark remote session. I don't think there are examples of doing that in Scala.
- Not sure about the second one. Usually
df.sparkSession
gives the access to session.
...r/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
Outdated
Show resolved
Hide resolved
@zhenlineo could you review this? I am going to get the test working right. |
Merged to master, thanks! |
This implements Scala foreachBatch(). The implementation basic and needs some more enhancements. The server side will be shared by Python implementation as well. One notable hack in this PR is that it runs user's `foreachBatch()` with regular(legacy) DataFrame, rather than setting up remote Spark connect session and connect DataFrame. ### Why are the changes needed? Adds foreachBatch() support in Scala Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes. Adds foreachBatch() API ### How was this patch tested? - A simple unit test. Closes apache#41969 from rangadi/feb-scala. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Xinrong Meng <xinrong@apache.org>
This implements Scala foreachBatch(). The implementation basic and needs some more enhancements. The server side will be shared by Python implementation as well.
One notable hack in this PR is that it runs user's
foreachBatch()
with regular(legacy) DataFrame, rather than setting up remote Spark connect session and connect DataFrame.Why are the changes needed?
Adds foreachBatch() support in Scala Spark Connect.
Does this PR introduce any user-facing change?
Yes. Adds foreachBatch() API
How was this patch tested?