[SPARK-48743][SQL][SS] MergingSessionIterator should better handle when getStruct returns null #47134

WweiL · 2024-06-27T22:28:45Z

What changes were proposed in this pull request?

The getStruct() method used in MergingSessionIterator.initialize could return a null value. When that happens, the copy() called upon it throws a NullPointerException.

We see an exception thrown there:

ava.lang.NullPointerException: <Redacted Exception Message>
	at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.initialize(MergingSessionsIterator.scala:121)
	at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.<init>(MergingSessionsIterator.scala:130)
	at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1(MergingSessionsExec.scala:93)
	at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1$adapted(MergingSessionsExec.scala:72)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:920)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:920)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:189)
	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:154)
	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:148)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:101)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:984)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:987)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:879)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

It is still not clear why that field could be null, but in general Spark should not throw NPEs. So this PR purposes to wrap it with SparkException.internalError with more details.

Why are the changes needed?

Improvemtns

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is a hard-to repro issue. The change should not cause any harm.

Was this patch authored or co-authored using generative AI tooling?

No

WweiL · 2024-06-27T22:29:21Z

cc @HeartSaVioR @sigmod PTAL! Thank you!

HeartSaVioR

Only minor comment.

HeartSaVioR · 2024-07-04T04:00:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/MergingSessionsIterator.scala

@@ -118,7 +118,9 @@ class MergingSessionsIterator(
      val inputRow = inputIterator.next()
      nextGroupingKey = groupingWithoutSessionProjection(inputRow).copy()
      val session = sessionProjection(inputRow)
-      nextGroupingSession = session.getStruct(0, 2).copy()
+      val groupingSession = session.getStruct(0, 2)


We have another place to have the same possibility of NPE, processCurrentSortedGroup(). We may want to extract the logic to apply the same to both places.

I guess it's OK to set errorOnIterator = true and throw internalError "here" instead of wrapping the call of initialize() with try-catch.

HeartSaVioR

Another minor.

HeartSaVioR · 2024-07-09T03:06:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/MergingSessionsIterator.scala

@@ -118,7 +118,11 @@ class MergingSessionsIterator(
      val inputRow = inputIterator.next()
      nextGroupingKey = groupingWithoutSessionProjection(inputRow).copy()
      val session = sessionProjection(inputRow)
-      nextGroupingSession = session.getStruct(0, 2).copy()
+      val groupingSession = session.getStruct(0, 2)
+      if (groupingSession == null) {


Let's be conservative, errorOnIterator = true to invalidate the iterator. Please apply the same to below as well.

ah thanks! addressed

HeartSaVioR

+1 pending CI

HyukjinKwon · 2024-07-09T10:51:41Z

Merged to master.

…en getStruct returns null ### What changes were proposed in this pull request? The getStruct() method used in `MergingSessionIterator.initialize` could return a null value. When that happens, the copy() called upon it throws a NullPointerException. We see an exception thrown there: ``` ava.lang.NullPointerException: <Redacted Exception Message> at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.initialize(MergingSessionsIterator.scala:121) at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.<init>(MergingSessionsIterator.scala:130) at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1(MergingSessionsExec.scala:93) at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1$adapted(MergingSessionsExec.scala:72) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:920) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:920) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:189) at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:154) at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:148) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:101) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:984) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:987) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:879) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` It is still not clear why that field could be null, but in general Spark should not throw NPEs. So this PR purposes to wrap it with SparkException.internalError with more details. ### Why are the changes needed? Improvemtns ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a hard-to repro issue. The change should not cause any harm. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47134 from WweiL/SPARK-48743-mergingSessionIterator-null-init. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

this?

bbe0f27

github-actions bot added the SQL label Jun 27, 2024

WweiL changed the title ~~this?~~ [SPARK-48743][SS]MergingSessionIterator should better handle when getStruct returns null Jun 27, 2024

update

56438a9

HeartSaVioR reviewed Jul 4, 2024

View reviewed changes

HeartSaVioR changed the title ~~[SPARK-48743][SS]MergingSessionIterator should better handle when getStruct returns null~~ [SPARK-48743][SQL][SS] MergingSessionIterator should better handle when getStruct returns null Jul 4, 2024

address Jungtaek's suggestion

1cea41e

WweiL requested a review from HeartSaVioR July 8, 2024 21:35

HeartSaVioR reviewed Jul 9, 2024

View reviewed changes

update

42e1c78

WweiL requested a review from HeartSaVioR July 9, 2024 07:15

HeartSaVioR approved these changes Jul 9, 2024

View reviewed changes

HyukjinKwon approved these changes Jul 9, 2024

View reviewed changes

HyukjinKwon closed this in 18f2450 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48743][SQL][SS] MergingSessionIterator should better handle when getStruct returns null #47134

[SPARK-48743][SQL][SS] MergingSessionIterator should better handle when getStruct returns null #47134

WweiL commented Jun 27, 2024

WweiL commented Jun 27, 2024

HeartSaVioR left a comment

HeartSaVioR Jul 4, 2024

HeartSaVioR left a comment

HeartSaVioR Jul 9, 2024

WweiL Jul 9, 2024

HeartSaVioR left a comment

HyukjinKwon commented Jul 9, 2024

[SPARK-48743][SQL][SS] MergingSessionIterator should better handle when getStruct returns null #47134

[SPARK-48743][SQL][SS] MergingSessionIterator should better handle when getStruct returns null #47134

Conversation

WweiL commented Jun 27, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WweiL commented Jun 27, 2024

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR Jul 4, 2024

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR Jul 9, 2024

Choose a reason for hiding this comment

WweiL Jul 9, 2024

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 9, 2024