Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23014][SS] Fully remove V1 memory sink. #24403

Closed
wants to merge 9 commits into from

Conversation

gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Apr 18, 2019

What changes were proposed in this pull request?

There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely.
What this PR contains:

  • V1 memory sink removal
  • V2 memory sink renamed to become the only implementation
  • Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant
  • Adapted all the tests

How was this patch tested?

Existing unit tests.

@gaborgsomogyi
Copy link
Contributor Author

Just for the sake of understanding why python adaptation needed.
With V1 memory sink the following exception arrived:

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 428, in main
    process()
  File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 423, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 457, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 446, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 86, in <lambda>
    return lambda *a: f(*a)
  File "/Users/gaborsomogyi/spark/python/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/Users/gaborsomogyi/spark/python/pyspark/sql/tests/test_streaming.py", line 217, in <lambda>
    bad_udf = udf(lambda x: 1 / 0)
ZeroDivisionError: integer division or modulo by zero

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:25)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:748)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:852)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:852)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1349)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
=== Streaming Query ===
Identifier: this_query [id = fcda74c8-05e0-4c74-ac45-3444f3e50c63, runId = 7534a374-5d15-4986-bae1-8fa0664c647b]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming]: {"logOffset":0}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
Project [<lambda>(value#4) AS <lambda>(value)#17]
+- StreamingExecutionRelation FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming], [value#4]

With V2 memory sink the following exception arrived (here additional exceptions are linked to the top level one, for instance ZeroDivisionError):

Writing job aborted.
=== Streaming Query ===
Identifier: this_query [id = d37ca2ea-edb7-4670-b4cc-75b0d704496e, runId = 587d62da-8355-4f66-a9d8-708c154ef953]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming]: {"logOffset":0}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
WriteToMicroBatchDataSource org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@1a3691ef
+- Project [<lambda>(value#18) AS <lambda>(value)#32]
   +- StreamingExecutionRelation FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming], [value#18]

@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104701 has finished for PR 24403 at commit 7b7d651.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104702 has finished for PR 24403 at commit 247caad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi gaborgsomogyi changed the title [SPARK-23014][SS] Fully remove V1 memory sink. [WIP][SPARK-23014][SS] Fully remove V1 memory sink. Apr 18, 2019
@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104705 has finished for PR 24403 at commit d65997d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104713 has finished for PR 24403 at commit f693e2b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104714 has finished for PR 24403 at commit 4bd49c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi gaborgsomogyi changed the title [WIP][SPARK-23014][SS] Fully remove V1 memory sink. [SPARK-23014][SS] Fully remove V1 memory sink. Apr 18, 2019
@gaborgsomogyi
Copy link
Contributor Author

cc @jose-torres since you've started this originally and @HyukjinKwon because of the python change.

@gaborgsomogyi
Copy link
Contributor Author

Not much movement cc @vanzin

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor things.

core/src/main/scala/org/apache/spark/TestUtils.scala Outdated Show resolved Hide resolved
core/src/main/scala/org/apache/spark/TestUtils.scala Outdated Show resolved Hide resolved
python/pyspark/sql/streaming.py Show resolved Hide resolved
python/pyspark/sql/tests/test_streaming.py Outdated Show resolved Hide resolved
python/pyspark/sql/tests/test_streaming.py Outdated Show resolved Hide resolved
python/pyspark/sql/utils.py Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104992 has finished for PR 24403 at commit 728fa06.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104999 has finished for PR 24403 at commit 41ffa5d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #105000 has finished for PR 24403 at commit 8f432f1.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104995 has finished for PR 24403 at commit 91d35a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #105001 has finished for PR 24403 at commit 2daaae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Apr 29, 2019

Merging to master.

@vanzin vanzin closed this in fb6b19a Apr 29, 2019
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely.
What this PR contains:
* V1 memory sink removal
* V2 memory sink renamed to become the only implementation
* Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant
* Adapted all the tests

Existing unit tests.

Closes apache#24403 from gaborgsomogyi/SPARK-23014.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants