[SPARK-23014][SS] Fully remove V1 memory sink. #24403

gaborgsomogyi · 2019-04-18T12:29:55Z

What changes were proposed in this pull request?

There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely.
What this PR contains:

V1 memory sink removal
V2 memory sink renamed to become the only implementation
Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant
Adapted all the tests

How was this patch tested?

Existing unit tests.

gaborgsomogyi · 2019-04-18T12:35:54Z

Just for the sake of understanding why python adaptation needed.
With V1 memory sink the following exception arrived:

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 428, in main
    process()
  File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 423, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 457, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 446, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 86, in <lambda>
    return lambda *a: f(*a)
  File "/Users/gaborsomogyi/spark/python/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/Users/gaborsomogyi/spark/python/pyspark/sql/tests/test_streaming.py", line 217, in <lambda>
    bad_udf = udf(lambda x: 1 / 0)
ZeroDivisionError: integer division or modulo by zero

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:25)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:748)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:852)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:852)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1349)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
=== Streaming Query ===
Identifier: this_query [id = fcda74c8-05e0-4c74-ac45-3444f3e50c63, runId = 7534a374-5d15-4986-bae1-8fa0664c647b]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming]: {"logOffset":0}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
Project [<lambda>(value#4) AS <lambda>(value)#17]
+- StreamingExecutionRelation FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming], [value#4]

With V2 memory sink the following exception arrived (here additional exceptions are linked to the top level one, for instance ZeroDivisionError):

Writing job aborted.
=== Streaming Query ===
Identifier: this_query [id = d37ca2ea-edb7-4670-b4cc-75b0d704496e, runId = 587d62da-8355-4f66-a9d8-708c154ef953]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming]: {"logOffset":0}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
WriteToMicroBatchDataSource org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@1a3691ef
+- Project [<lambda>(value#18) AS <lambda>(value)#32]
   +- StreamingExecutionRelation FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming], [value#18]

SparkQA · 2019-04-18T12:38:46Z

Test build #104701 has finished for PR 24403 at commit 7b7d651.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T13:43:09Z

Test build #104702 has finished for PR 24403 at commit 247caad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T16:16:30Z

Test build #104705 has finished for PR 24403 at commit d65997d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T17:01:55Z

Test build #104713 has finished for PR 24403 at commit f693e2b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-18T20:18:08Z

Test build #104714 has finished for PR 24403 at commit 4bd49c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-04-18T20:26:33Z

cc @jose-torres since you've started this originally and @HyukjinKwon because of the python change.

gaborgsomogyi · 2019-04-25T11:38:03Z

Not much movement cc @vanzin

vanzin

Just minor things.

core/src/main/scala/org/apache/spark/TestUtils.scala

mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala

python/pyspark/sql/streaming.py

python/pyspark/sql/tests/test_streaming.py

python/pyspark/sql/utils.py

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memory.scala

SparkQA · 2019-04-29T12:38:33Z

Test build #104992 has finished for PR 24403 at commit 728fa06.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-29T13:58:08Z

Test build #104999 has finished for PR 24403 at commit 41ffa5d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-29T14:05:51Z

Test build #105000 has finished for PR 24403 at commit 8f432f1.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-29T15:39:27Z

Test build #104995 has finished for PR 24403 at commit 91d35a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-29T16:14:58Z

Test build #105001 has finished for PR 24403 at commit 2daaae3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-04-29T16:44:07Z

Merging to master.

There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely. What this PR contains: * V1 memory sink removal * V2 memory sink renamed to become the only implementation * Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant * Adapted all the tests Existing unit tests. Closes apache#24403 from gaborgsomogyi/SPARK-23014. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

[SPARK-23014][SS] Fully remove V1 memory sink.

7b7d651

Style fix

247caad

gaborgsomogyi changed the title ~~[SPARK-23014][SS] Fully remove V1 memory sink.~~ [WIP][SPARK-23014][SS] Fully remove V1 memory sink. Apr 18, 2019

ML test fix

d65997d

Test fixes

f693e2b

Style fix

4bd49c6

gaborgsomogyi changed the title ~~[WIP][SPARK-23014][SS] Fully remove V1 memory sink.~~ [SPARK-23014][SS] Fully remove V1 memory sink. Apr 18, 2019

vanzin reviewed Apr 26, 2019

View reviewed changes

Merge branch 'master' into SPARK-23014

91d35a0

gaborgsomogyi force-pushed the SPARK-23014 branch from 728fa06 to 91d35a0 Compare April 29, 2019 13:18

Review fixes

41ffa5d

gaborgsomogyi added 2 commits April 29, 2019 16:00

Fixed assertExceptionMsg description

8f432f1

Style fix

2daaae3

vanzin closed this in fb6b19a Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23014][SS] Fully remove V1 memory sink. #24403

[SPARK-23014][SS] Fully remove V1 memory sink. #24403

gaborgsomogyi commented Apr 18, 2019 •

edited

Loading

gaborgsomogyi commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

gaborgsomogyi commented Apr 18, 2019

gaborgsomogyi commented Apr 25, 2019

vanzin left a comment

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

vanzin commented Apr 29, 2019

[SPARK-23014][SS] Fully remove V1 memory sink. #24403

[SPARK-23014][SS] Fully remove V1 memory sink. #24403

Conversation

gaborgsomogyi commented Apr 18, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

gaborgsomogyi commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

SparkQA commented Apr 18, 2019

gaborgsomogyi commented Apr 18, 2019

gaborgsomogyi commented Apr 25, 2019

vanzin left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

vanzin commented Apr 29, 2019

gaborgsomogyi commented Apr 18, 2019 •

edited

Loading