[SPARK-27812][CORE] Explicit System.exit after job's main #25785

igorcalabria · 2019-09-13T19:51:28Z

What changes were proposed in this pull request?

Explicit calling System.exit after user's main code runs.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

If there are non-daemon threads running, the JVM won't call ShutdownHook after the driver's main exits. This means that any job running on kubernetes that doesn't explicitly call SparkSession#stop will hang. I believe that expecting users to include this to every job is unreasonable since they also need to remember to add an UncaughtExceptionHandler. If there's no exception handler, any exception thrown on the driver's side will also hang the process.

Since I'm not that familiar with spark's codebase, this could be a terrible idea and I'm hopping that some of you guys could propose a better solution if that's the case. My educated guess is that there's no expectation that the application will continue to run after the declared main, the only difference is that we're now calling System.exit so shutdown hooks run independently of random non daemon threads.

Does this PR introduce any user-facing change?

I'm guessing that no. It does not introduce something that the user should notice(besides the fix)

How was this patch tested?

Took SparkPI example and removed the spark.stop() call. Expected behaviour is that the driver exits after the job, but it doesn't. This patch fixed this.

fixes case where non daemon threads prevents application shutdown

dongjoon-hyun · 2019-09-13T21:54:00Z

ok to test

dongjoon-hyun · 2019-09-13T21:55:31Z

Hi, @igorcalabria . I know you are working on K8s, but the component tag is based on the code the PR touches. I adjusted the PR title.

SparkQA · 2019-09-13T23:37:12Z

Test build #110574 has finished for PR 25785 at commit cae825c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-14T14:34:09Z

Wait, shutdown hooks definitely run when the JVM terminates, unless it is forcibly killed or crashes. I am not sure this is the issue. You don't want to remove a call to stop(); it is necessary. If your app creates non-daemon threads, it has to ensure they terminate or else indeed any Java application won't stop after main() exits in that case.

igorcalabria · 2019-09-14T15:54:31Z

The main problem is that spark itself(k8s client) is creating the non-daemon thread.

srowen · 2019-09-14T17:18:32Z

I see, can that thread be a daemon? If System.exit is viable (i.e. immediately stopping daemon threads) then it should be. But if not, then yeah such a thread needs to be shut down cleanly somehow during the shutdown process. This could be a shutdown hook.

dongjoon-hyun · 2019-09-14T21:04:56Z

Retest this please.

SparkQA · 2019-09-14T22:48:35Z

Test build #110600 has finished for PR 25785 at commit cae825c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Hi, @igorcalabria .
According to the test result, this PR seems to break Spark Thrift Server module at least UT level. The following is one example failure. To be considered as mergeable, this PR should pass all UTs at least. Could you focus on fixing that module?

$ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *.SingleSessionSuite"
...
[info] Tests: succeeded 0, failed 3, canceled 0, ignored 0, pending 0
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 3, Failed 3, Errors 0, Passed 0
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.thriftserver.SingleSessionSuite
[error] (hive-thriftserver/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 258 s, completed Sep 14, 2019 4:16:09 PM

igorcalabria · 2019-09-15T00:42:23Z

I see, can that thread be a daemon? If System.exit is viable (i.e. immediately stopping daemon threads) then it should be. But if not, then yeah such a thread needs to be shut down cleanly somehow during the shutdown process. This could be a shutdown hook.

I don't think there's an option to create a daemon thread in this case. This was already discussed on square/okhttp#3339

I'm sorry, but I didn't understand what you meant about the viability of System.exit. There's already a shutdown logic for the kubernetes client https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L115 but this only called if the shutdownHook runs or if the user calls SparkSession#stop(). Calling system.exit is just a way to ensure that the hooks run after the user's main code. Everything is still cleaned up nicely since the System.exit does not bypass the registered hooks.

Hi, @igorcalabria .
According to the test result, this PR seems to break Spark Thrift Server module at least UT level. The following is one example failure. To be considered as mergeable, this PR should pass all UTs at least. Could you focus on fixing that module?
$ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *.SingleSessionSuite"
...
[info] Tests: succeeded 0, failed 3, canceled 0, ignored 0, pending 0
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 3, Failed 3, Errors 0, Passed 0
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.thriftserver.SingleSessionSuite
[error] (hive-thriftserver/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 258 s, completed Sep 14, 2019 4:16:09 PM

@dongjoon-hyun I'll take a look.

srowen · 2019-09-15T03:29:24Z

Applications must call stop(). More things don't stop if an app doesn't, when it intends to finish. If things work after calling stop() I think it's already working as intended.

Ngone51

System.exit seems overkill to me, as it may also kill user's non-daemon threads.

Maybe, we could cache those generated kubernetesClients within SparkKubernetesClientFactory and close them by SparkKubernetesClientFactory in KubernetesClientApplication's run() ?

Note I'm not familiar with k8s code well in Spark, just hope this would help you.

igorcalabria · 2019-09-16T16:56:46Z

I actually found the root issue that introduced the non-daemon thread in the kubernetes lib fabric8io/kubernetes-client#1301. It is hardcoding the value of pingInterval in okhttp's lib. When that value is higher than 0, it creates a non-daemon thread.

Spark <= 2.4.0 worked fine because it sets that value to 0 and used a version prior to the linked PR. We're still setting that value to zero https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala#L83 but it's not being respected by kubernetes lib anymore.

I totally agree that System.exit is a overkill solution for this problem and could have unforeseen consequences. You guys can close this PR, I'll ping back with a new PR when kubernetes-client accepts my changes.

AmplabJenkins · 2019-09-16T18:06:13Z

Can one of the admins verify this patch?

srowen · 2019-09-17T13:33:45Z

OK, thanks for the investigation

ifilonenko · 2019-10-10T18:46:46Z

hey @igorcalabria, do you have a link to your PR in kubernetes-client? This problem affects our spark-on-k8s jobs and I would like to have a timeline on the patch or offer help where I can.

igorcalabria · 2019-10-11T15:46:59Z

@ifilonenko It is already released on version 4.6 of the client. I've just opened a new PR updating spark's kubernete client here #26093

### What changes were proposed in this pull request? Updated kubernetes client. ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785 ### Does this PR introduce any user-facing change? Nope, it should be transparent to users ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully Closes #26093 from igorcalabria/k8s-client-update. Authored-by: igor.calabria <igor.calabria@ubee.in> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

# What changes were proposed in this pull request? Backport of #26093 to `branch-2.4` ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785 ### Does this PR introduce any user-facing change? No ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully Closes #26152 from igorcalabria/k8s-client-update-2.4. Authored-by: igor.calabria <igor.calabria@ubee.in> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

johnlinp · 2020-04-22T02:56:54Z

@igorcalabria After upgrading Spark to 2.4.5 (with your PR #26152 included), there is no more non-daemon threads produced by Spark. In addition to that, SparkSession implements Closeable, so shutdown hooks should call stop() implicitly. Does that mean we don't need to explicitly call stop() in application anymore?

explicit exit after job's main

cae825c

fixes case where non daemon threads prevents application shutdown

dongjoon-hyun changed the title ~~[SPARK-27812][CORE][Kubernetes] Explicit System.exit after job's main~~ [SPARK-27812][CORE] Explicit System.exit after job's main Sep 13, 2019

dongjoon-hyun added the SPARK CORE label Sep 13, 2019

dongjoon-hyun requested changes Sep 14, 2019

View reviewed changes

Ngone51 reviewed Sep 15, 2019

View reviewed changes

srowen closed this Sep 17, 2019

igorcalabria mentioned this pull request Oct 11, 2019

[SPARK-27812][K8S] Bump K8S client version to 4.6.1 #26093

Closed

igorcalabria mentioned this pull request Oct 17, 2019

[SPARK-27812][K8S][2.4] Bump K8S client version to 4.6.1 #26152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27812][CORE] Explicit System.exit after job's main #25785

[SPARK-27812][CORE] Explicit System.exit after job's main #25785

igorcalabria commented Sep 13, 2019

dongjoon-hyun commented Sep 13, 2019

dongjoon-hyun commented Sep 13, 2019

SparkQA commented Sep 13, 2019

srowen commented Sep 14, 2019

igorcalabria commented Sep 14, 2019

srowen commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

SparkQA commented Sep 14, 2019

dongjoon-hyun left a comment •

edited

Loading

igorcalabria commented Sep 15, 2019

srowen commented Sep 15, 2019

Ngone51 left a comment

igorcalabria commented Sep 16, 2019

AmplabJenkins commented Sep 16, 2019

srowen commented Sep 17, 2019

ifilonenko commented Oct 10, 2019

igorcalabria commented Oct 11, 2019

johnlinp commented Apr 22, 2020 •

edited

Loading

[SPARK-27812][CORE] Explicit System.exit after job's main #25785

[SPARK-27812][CORE] Explicit System.exit after job's main #25785

Conversation

igorcalabria commented Sep 13, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Sep 13, 2019

dongjoon-hyun commented Sep 13, 2019

SparkQA commented Sep 13, 2019

srowen commented Sep 14, 2019

igorcalabria commented Sep 14, 2019

srowen commented Sep 14, 2019

dongjoon-hyun commented Sep 14, 2019

SparkQA commented Sep 14, 2019

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

igorcalabria commented Sep 15, 2019

srowen commented Sep 15, 2019

Ngone51 left a comment

Choose a reason for hiding this comment

igorcalabria commented Sep 16, 2019

AmplabJenkins commented Sep 16, 2019

srowen commented Sep 17, 2019

ifilonenko commented Oct 10, 2019

igorcalabria commented Oct 11, 2019

johnlinp commented Apr 22, 2020 • edited Loading

dongjoon-hyun left a comment •

edited

Loading

johnlinp commented Apr 22, 2020 •

edited

Loading