-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Spark streaming support #8868
Comments
I can't seem to be able to modify the description - can you add the window op GitHub issue? It's #8847 Going to add more some details here based on my investigation because I can't directly modify the issue. Also figured it may be easier to track discussions over time - High-level observations & summariesI'm going to use "spark streaming" below to refer to "spark in streaming mode" as opposed to the Spark Streaming API, which is no longer maintained in favor of the Spark Structured Streaming API. We're looking to support Spark Structured Streaming in Spark SQL, rather than Dataframe API. This is because our current pyspark backend is a string-generating backend. This allows us to leverage existing work that we have done for spark batch. Unfortunately Spark SQL for streaming is not well documented or widely used. I have not come across companies that use Spark SQL for streaming. Stackoverflow posts on Spark SQL for streaming are also sparse. OpsThere are some operations not supported with streaming dataframes/datasets:
Sources
Sinks
Task breakdownA list of tasks in order to support spark streaming:
My understanding is that over aggregation in the way that is supported by Flink needs to be accomplished with arbitrary stateful operations in spark streaming. Example workflowI have set up an streaming window aggregation example using pyspark. This example reads from an upstream Kafka source, computes a windowed aggregation, then writes the output into a downstream Kafka sink. I'm using a Kafka sink here because it's easy to set up and does not require spinning up additional infrastructure. This is very similar to the example that we were using for Flink and is (somewhat) representative of a real-world use case. Detailed steps as follows (with notes):
from pyspark.sql import SparkSession
session = SparkSession.builder \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1")\
.getOrCreate() [NOTES]
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructField, StructType, TimestampType, LongType, DoubleType, IntegerType
schema = StructType(
[
StructField('createTime', TimestampType(), True),
StructField('orderId', LongType(), True),
StructField('payAmount', DoubleType(), True),
StructField('payPlatform', IntegerType(), True),
StructField('provinceId', IntegerType(), True),
])
streaming_df = session.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "payment_msg")\
.option("startingOffsets","earliest")\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))\
.select("parsed_value.*")\
.withWatermark("createTime", "10 seconds") [NOTES]
streaming_df.createOrReplaceTempView("streaming_df")
window_agg = session.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM streaming_df
GROUP BY provinceId, window(createTime, '1 minute', '30 seconds')
""") [NOTES]
This is how steps 4-5 would look like in a single query. I think when we compile it using Ibis it will be a two-step process most likely. window_agg = session.sql("""
SELECT to_json(named_struct('start', window.start, 'end', window.end, 'provinceId', provinceId, 'totalPayAmount', sum(payAmount))) as value
FROM streaming_df
GROUP BY provinceId, window(createTime, '1 minute', '30 seconds')
""")
result_df = (window_agg
.writeStream
.outputMode("append")
.format("kafka")
.option("checkpointLocation", "checkpoint")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "sink")
.start()) [NOTES]
Helpful reads: |
Weekly Update, 2024-04-03
|
Yes it's the same. I grouped everything under op logic |
Weekly Update, 2024-04-25
|
Weekly update, 5/2/2024
|
Weekly update, 5/9/24
|
Weekly update, 5/15/24
|
Weekly update, 5/23/24
|
Weekly update, 5/30/24
|
@chloeh13q can this be closed? |
Closing as completed for now. |
Is your feature request related to a problem?
This meta-issue is for the
Spark Streaming
epic by which we plan to add spark streaming support in Ibis.This issue is meant to contain general information and notes that we collect while breaking down the epic.
More specifically, we will use this doc to record and share
Once we have clarity on the high-level tasks, they will be assigned to individual owners, who will then create the corresponding Github issues.
High-level design decisions
We should
pyspark
backend rather than adding a new backend.Initial exploration (
OngoingDone)@chloeh13q has been experimenting with spark streaming. Initial findings:
window aggregation
andasof join
as those are the ones that have been decided as priority.Update: See comment below for the summary of the exploration outcome.
Breakdown into issues
As mentioned above, our initial understanding is that Spark SQL has the same syntax for both batch and streaming. This is why, anything at the intersection of these two should just work. Majority of the work in this epic will be to add support for streaming specific features in
pyspark/streaming
.Note: The to-do list below includes all the tasks that need to finish to support spark streaming with enough feature support. We do not expect to complete all of these in a single quarter.
[P0] Validate the existing pyspark-specific tests in streaming mode
[P0] Connect to streaming sources and sinks
Operation logic
[P0] Define watermark on a streaming table[P0] As-of joins[P2] UDFs/Data enrichment
[P2] Nested schema support
[P1, might be dropped if deemed not valuable enough] Streaming specific test suite.
[P1]
window join
support forpyspark/streaming
.[P2]
semi/anti window join
support forpyspark/streaming
.[P2]
time travel
support forpyspark/streaming
.Note that the following streaming queries are not supported in Spark
Note
: Issues created for the items above will be linked here.What version of ibis are you running?
8.0.0
What backend(s) are you using, if any?
Spark
Code of Conduct
The text was updated successfully, but these errors were encountered: