[SPARK-44398][CONNECT] Scala foreachBatch API #41969

rangadi · 2023-07-12T20:31:24Z

This implements Scala foreachBatch(). The implementation basic and needs some more enhancements. The server side will be shared by Python implementation as well.

One notable hack in this PR is that it runs user's foreachBatch() with regular(legacy) DataFrame, rather than setting up remote Spark connect session and connect DataFrame.

Why are the changes needed?

Adds foreachBatch() support in Scala Spark Connect.

Does this PR introduce any user-facing change?

Yes. Adds foreachBatch() API

How was this patch tested?

A simple unit test.

rangadi · 2023-07-12T20:46:12Z

cc: @bogao007

bogao007

LGTM overall, left some questions regarding to the hack.

bogao007 · 2023-07-12T21:46:27Z

...server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingForeachBatchHelper.scala

+   * Handles setting up Scala remote session and other Spark Connect environment and then
+   * runs the provided foreachBatch function `fn`.
+   *
+   * HACK ALERT: This version does not atually set up Spark connect. Directly passes the DataFrame,


I have 2 major questions regarding this:

Is the missing part about setting up a Spark Connect session and converting the legacy DataFrame to Spark Connect DataFrame and being executed inside the Spark Connect session? Do we have any Scala example on setting up Spark Connect session on server side and use it?

When is getDataFrameOrThrow() being called? Is it only needed for Python or do we need to get the DataFrame by ID inside the Spark Connect session for Scala?

Yes, it is about setting up spark remote session. I don't think there are examples of doing that in Scala.

Not sure about the second one. Usually df.sparkSession gives the access to session.

...r/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

rangadi · 2023-07-12T23:49:40Z

@zhenlineo could you review this? I am going to get the test working right.
We want to merge this ASAP as we are getting multiple streams of work done by 3.5 branch cut.

connector/connect/common/src/main/protobuf/spark/connect/commands.proto

…achBatch

xinrong-meng · 2023-07-13T17:48:27Z

Merged to master, thanks!

This implements Scala foreachBatch(). The implementation basic and needs some more enhancements. The server side will be shared by Python implementation as well. One notable hack in this PR is that it runs user's `foreachBatch()` with regular(legacy) DataFrame, rather than setting up remote Spark connect session and connect DataFrame. ### Why are the changes needed? Adds foreachBatch() support in Scala Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes. Adds foreachBatch() API ### How was this patch tested? - A simple unit test. Closes apache#41969 from rangadi/feb-scala. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Xinrong Meng <xinrong@apache.org>

Protobuf updates and simple implementation of scala ForeachBatch.

d40352c

github-actions bot added SQL STRUCTURED STREAMING CONNECT labels Jul 12, 2023

rangadi added 2 commits July 12, 2023 14:36

build fix

5f0486a

Merge remote-tracking branch 'upstream/master' into feb-scala

5ffe762

bogao007 approved these changes Jul 12, 2023

View reviewed changes

rangadi marked this pull request as ready for review July 12, 2023 23:46

zhenlineo approved these changes Jul 13, 2023

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/commands.proto Show resolved Hide resolved

Tests work now.

e1f7b94

github-actions bot added CORE PYTHON labels Jul 13, 2023

rangadi added 3 commits July 13, 2023 00:47

scalafmt

cadf311

Reuse StreamingForeachFunction protobuf for both foreachWriter & fore…

d9ac8c0

…achBatch

Merge remote-tracking branch 'upstream/master' into feb-scala

0d37764

xinrong-meng approved these changes Jul 13, 2023

View reviewed changes

xinrong-meng closed this in 4771853 Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44398][CONNECT] Scala foreachBatch API #41969

[SPARK-44398][CONNECT] Scala foreachBatch API #41969

rangadi commented Jul 12, 2023

rangadi commented Jul 12, 2023

bogao007 left a comment

bogao007 Jul 12, 2023

rangadi Jul 12, 2023

rangadi commented Jul 12, 2023

xinrong-meng commented Jul 13, 2023

[SPARK-44398][CONNECT] Scala foreachBatch API #41969

[SPARK-44398][CONNECT] Scala foreachBatch API #41969

Conversation

rangadi commented Jul 12, 2023

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rangadi commented Jul 12, 2023

bogao007 left a comment

Choose a reason for hiding this comment

bogao007 Jul 12, 2023

Choose a reason for hiding this comment

rangadi Jul 12, 2023

Choose a reason for hiding this comment

rangadi commented Jul 12, 2023

xinrong-meng commented Jul 13, 2023