[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return the actual StreamingQueryProgress #47470

WweiL · 2024-07-24T06:47:02Z

This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in #47468

What changes were proposed in this pull request?

This PR is created after discussion in this closed one: #46886
I was trying to fix a bug (in connect, query.lastProgress doesn't have numInputRows, inputRowsPerSecond, and processedRowsPerSecond), and we reached the conclusion that what purposed in this PR should be the ultimate fix.

In python, for both classic spark and spark connect, the return type of lastProgress is Dict (and recentProgress is List[Dict]), but in scala it's the actual StreamingQueryProgress object:

spark/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

Lines 94 to 101 in 1a5d22a

    
             def recentProgress: Array[StreamingQueryProgress] 
        
             /** 
        
              * Returns the most recent [[StreamingQueryProgress]] update of this streaming query. 
        
              * 
        
              * @since 2.1.0 
        
              */ 
        
             def lastProgress: StreamingQueryProgress

This API discrepancy brings some confusion, like in Scala, users can do query.lastProgress.batchId, while in Python they have to do query.lastProgress["batchId"].

This PR makes StreamingQuery.lastProgress to return the actual StreamingQueryProgress (and StreamingQuery.recentProgress to return List[StreamingQueryProgress]).

To prevent breaking change, we extend StreamingQueryProgress to be a subclass of dict, so existing code accessing using dictionary method (e.g. query.lastProgress["id"]) is still functional.

Why are the changes needed?

API parity

Does this PR introduce any user-facing change?

Yes, now StreamingQuery.lastProgress returns the actual StreamingQueryProgress (and StreamingQuery.recentProgress returns List[StreamingQueryProgress]).

How was this patch tested?

Added unit test

Was this patch authored or co-authored using generative AI tooling?

No

…the actual StreamingQueryProgress" This reverts commit d067fc6.

WweiL · 2024-07-24T06:47:10Z

cc @HyukjinKwon

HyukjinKwon · 2024-07-24T09:57:27Z

Merged to master.

… the actual StreamingQueryProgress This reverts commit d067fc6, which reverted 042804a, essentially brings it back. 042804a failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in apache#47468 ### What changes were proposed in this pull request? This PR is created after discussion in this closed one: apache#46886 I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix. In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object: https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101 This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`. This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`). To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional. ### Why are the changes needed? API parity ### Does this PR introduce _any_ user-facing change? Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`). ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47470 from WweiL/bring-back-lastProgress. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Reapply "[SPARK-48567][SS] StreamingQuery.lastProgress should return …

b97798d

…the actual StreamingQueryProgress" This reverts commit d067fc6.

github-actions bot added SQL STRUCTURED STREAMING PYTHON CONNECT labels Jul 24, 2024

HyukjinKwon approved these changes Jul 24, 2024

View reviewed changes

HyukjinKwon closed this in 22eb6c4 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return the actual StreamingQueryProgress #47470

[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return the actual StreamingQueryProgress #47470

WweiL commented Jul 24, 2024

WweiL commented Jul 24, 2024

HyukjinKwon commented Jul 24, 2024

	def recentProgress: Array[StreamingQueryProgress]

	/**
	* Returns the most recent [[StreamingQueryProgress]] update of this streaming query.
	*
	* @since 2.1.0
	*/
	def lastProgress: StreamingQueryProgress

[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return the actual StreamingQueryProgress #47470

[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return the actual StreamingQueryProgress #47470

Conversation

WweiL commented Jul 24, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WweiL commented Jul 24, 2024

HyukjinKwon commented Jul 24, 2024