Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27812][CORE] Explicit System.exit after job's main #25785

Closed

Conversation

igorcalabria
Copy link
Contributor

What changes were proposed in this pull request?

Explicit calling System.exit after user's main code runs.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

If there are non-daemon threads running, the JVM won't call ShutdownHook after the driver's main exits. This means that any job running on kubernetes that doesn't explicitly call SparkSession#stop will hang. I believe that expecting users to include this to every job is unreasonable since they also need to remember to add an UncaughtExceptionHandler. If there's no exception handler, any exception thrown on the driver's side will also hang the process.

Since I'm not that familiar with spark's codebase, this could be a terrible idea and I'm hopping that some of you guys could propose a better solution if that's the case. My educated guess is that there's no expectation that the application will continue to run after the declared main, the only difference is that we're now calling System.exit so shutdown hooks run independently of random non daemon threads.

Does this PR introduce any user-facing change?

I'm guessing that no. It does not introduce something that the user should notice(besides the fix)

How was this patch tested?

Took SparkPI example and removed the spark.stop() call. Expected behaviour is that the driver exits after the job, but it doesn't. This patch fixed this.

fixes case where non daemon threads prevents application shutdown
@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27812][CORE][Kubernetes] Explicit System.exit after job's main [SPARK-27812][CORE] Explicit System.exit after job's main Sep 13, 2019
@dongjoon-hyun
Copy link
Member

Hi, @igorcalabria . I know you are working on K8s, but the component tag is based on the code the PR touches. I adjusted the PR title.

@SparkQA
Copy link

SparkQA commented Sep 13, 2019

Test build #110574 has finished for PR 25785 at commit cae825c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 14, 2019

Wait, shutdown hooks definitely run when the JVM terminates, unless it is forcibly killed or crashes. I am not sure this is the issue. You don't want to remove a call to stop(); it is necessary. If your app creates non-daemon threads, it has to ensure they terminate or else indeed any Java application won't stop after main() exits in that case.

@igorcalabria
Copy link
Contributor Author

The main problem is that spark itself(k8s client) is creating the non-daemon thread.

@srowen
Copy link
Member

srowen commented Sep 14, 2019

I see, can that thread be a daemon? If System.exit is viable (i.e. immediately stopping daemon threads) then it should be. But if not, then yeah such a thread needs to be shut down cleanly somehow during the shutdown process. This could be a shutdown hook.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Sep 14, 2019

Test build #110600 has finished for PR 25785 at commit cae825c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @igorcalabria .
According to the test result, this PR seems to break Spark Thrift Server module at least UT level. The following is one example failure. To be considered as mergeable, this PR should pass all UTs at least. Could you focus on fixing that module?

$ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *.SingleSessionSuite"
...
[info] Tests: succeeded 0, failed 3, canceled 0, ignored 0, pending 0
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 3, Failed 3, Errors 0, Passed 0
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.thriftserver.SingleSessionSuite
[error] (hive-thriftserver/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 258 s, completed Sep 14, 2019 4:16:09 PM

@igorcalabria
Copy link
Contributor Author

I see, can that thread be a daemon? If System.exit is viable (i.e. immediately stopping daemon threads) then it should be. But if not, then yeah such a thread needs to be shut down cleanly somehow during the shutdown process. This could be a shutdown hook.

I don't think there's an option to create a daemon thread in this case. This was already discussed on square/okhttp#3339

I'm sorry, but I didn't understand what you meant about the viability of System.exit. There's already a shutdown logic for the kubernetes client https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L115 but this only called if the shutdownHook runs or if the user calls SparkSession#stop(). Calling system.exit is just a way to ensure that the hooks run after the user's main code. Everything is still cleaned up nicely since the System.exit does not bypass the registered hooks.

Hi, @igorcalabria .
According to the test result, this PR seems to break Spark Thrift Server module at least UT level. The following is one example failure. To be considered as mergeable, this PR should pass all UTs at least. Could you focus on fixing that module?

$ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *.SingleSessionSuite"
...
[info] Tests: succeeded 0, failed 3, canceled 0, ignored 0, pending 0
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 3, Failed 3, Errors 0, Passed 0
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.thriftserver.SingleSessionSuite
[error] (hive-thriftserver/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 258 s, completed Sep 14, 2019 4:16:09 PM

@dongjoon-hyun I'll take a look.

@srowen
Copy link
Member

srowen commented Sep 15, 2019

Applications must call stop(). More things don't stop if an app doesn't, when it intends to finish. If things work after calling stop() I think it's already working as intended.

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.exit seems overkill to me, as it may also kill user's non-daemon threads.

Maybe, we could cache those generated kubernetesClients within SparkKubernetesClientFactory and close them by SparkKubernetesClientFactory in KubernetesClientApplication's run() ?

Note I'm not familiar with k8s code well in Spark, just hope this would help you.

@igorcalabria
Copy link
Contributor Author

I actually found the root issue that introduced the non-daemon thread in the kubernetes lib fabric8io/kubernetes-client#1301. It is hardcoding the value of pingInterval in okhttp's lib. When that value is higher than 0, it creates a non-daemon thread.

Spark <= 2.4.0 worked fine because it sets that value to 0 and used a version prior to the linked PR. We're still setting that value to zero https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala#L83 but it's not being respected by kubernetes lib anymore.

I totally agree that System.exit is a overkill solution for this problem and could have unforeseen consequences. You guys can close this PR, I'll ping back with a new PR when kubernetes-client accepts my changes.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Sep 17, 2019

OK, thanks for the investigation

@srowen srowen closed this Sep 17, 2019
@ifilonenko
Copy link
Contributor

hey @igorcalabria, do you have a link to your PR in kubernetes-client? This problem affects our spark-on-k8s jobs and I would like to have a timeline on the patch or offer help where I can.

@igorcalabria
Copy link
Contributor Author

@ifilonenko It is already released on version 4.6 of the client. I've just opened a new PR updating spark's kubernete client here #26093

dongjoon-hyun pushed a commit that referenced this pull request Oct 17, 2019
### What changes were proposed in this pull request?

Updated kubernetes client.

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

### Does this PR introduce any user-facing change?

Nope, it should be transparent to users

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully

Closes #26093 from igorcalabria/k8s-client-update.

Authored-by: igor.calabria <igor.calabria@ubee.in>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Oct 18, 2019
# What changes were proposed in this pull request?

Backport of #26093 to `branch-2.4`

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully

Closes #26152 from igorcalabria/k8s-client-update-2.4.

Authored-by: igor.calabria <igor.calabria@ubee.in>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@johnlinp
Copy link

johnlinp commented Apr 22, 2020

@igorcalabria After upgrading Spark to 2.4.5 (with your PR #26152 included), there is no more non-daemon threads produced by Spark. In addition to that, SparkSession implements Closeable, so shutdown hooks should call stop() implicitly. Does that mean we don't need to explicitly call stop() in application anymore?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants