-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27812][CORE] Explicit System.exit after job's main #25785
[SPARK-27812][CORE] Explicit System.exit after job's main #25785
Conversation
fixes case where non daemon threads prevents application shutdown
ok to test |
Hi, @igorcalabria . I know you are working on K8s, but the component tag is based on the code the PR touches. I adjusted the PR title. |
Test build #110574 has finished for PR 25785 at commit
|
Wait, shutdown hooks definitely run when the JVM terminates, unless it is forcibly killed or crashes. I am not sure this is the issue. You don't want to remove a call to stop(); it is necessary. If your app creates non-daemon threads, it has to ensure they terminate or else indeed any Java application won't stop after main() exits in that case. |
The main problem is that spark itself(k8s client) is creating the non-daemon thread. |
I see, can that thread be a daemon? If System.exit is viable (i.e. immediately stopping daemon threads) then it should be. But if not, then yeah such a thread needs to be shut down cleanly somehow during the shutdown process. This could be a shutdown hook. |
Retest this please. |
Test build #110600 has finished for PR 25785 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @igorcalabria .
According to the test result, this PR seems to break Spark Thrift Server
module at least UT level. The following is one example failure. To be considered as mergeable, this PR should pass all UTs at least. Could you focus on fixing that module?
$ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *.SingleSessionSuite"
...
[info] Tests: succeeded 0, failed 3, canceled 0, ignored 0, pending 0
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 3, Failed 3, Errors 0, Passed 0
[error] Failed tests:
[error] org.apache.spark.sql.hive.thriftserver.SingleSessionSuite
[error] (hive-thriftserver/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 258 s, completed Sep 14, 2019 4:16:09 PM
I don't think there's an option to create a daemon thread in this case. This was already discussed on square/okhttp#3339 I'm sorry, but I didn't understand what you meant about the viability of
@dongjoon-hyun I'll take a look. |
Applications must call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System.exit
seems overkill to me, as it may also kill user's non-daemon threads.
Maybe, we could cache those generated kubernetesClient
s within SparkKubernetesClientFactory
and close them by SparkKubernetesClientFactory
in KubernetesClientApplication
's run()
?
Note I'm not familiar with k8s code well in Spark, just hope this would help you.
I actually found the root issue that introduced the non-daemon thread in the kubernetes lib fabric8io/kubernetes-client#1301. It is hardcoding the value of Spark <= 2.4.0 worked fine because it sets that value to 0 and used a version prior to the linked PR. We're still setting that value to zero https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala#L83 but it's not being respected by kubernetes lib anymore. I totally agree that System.exit is a overkill solution for this problem and could have unforeseen consequences. You guys can close this PR, I'll ping back with a new PR when kubernetes-client accepts my changes. |
Can one of the admins verify this patch? |
OK, thanks for the investigation |
hey @igorcalabria, do you have a link to your PR in |
@ifilonenko It is already released on version 4.6 of the client. I've just opened a new PR updating spark's kubernete client here #26093 |
### What changes were proposed in this pull request? Updated kubernetes client. ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785 ### Does this PR introduce any user-facing change? Nope, it should be transparent to users ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully Closes #26093 from igorcalabria/k8s-client-update. Authored-by: igor.calabria <igor.calabria@ubee.in> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
# What changes were proposed in this pull request? Backport of #26093 to `branch-2.4` ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785 ### Does this PR introduce any user-facing change? No ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully Closes #26152 from igorcalabria/k8s-client-update-2.4. Authored-by: igor.calabria <igor.calabria@ubee.in> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@igorcalabria After upgrading Spark to 2.4.5 (with your PR #26152 included), there is no more non-daemon threads produced by Spark. In addition to that, |
What changes were proposed in this pull request?
Explicit calling
System.exit
after user's main code runs.Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927
If there are non-daemon threads running, the JVM won't call
ShutdownHook
after the driver's main exits. This means that any job running on kubernetes that doesn't explicitly callSparkSession#stop
will hang. I believe that expecting users to include this to every job is unreasonable since they also need to remember to add anUncaughtExceptionHandler
. If there's no exception handler, any exception thrown on the driver's side will also hang the process.Since I'm not that familiar with spark's codebase, this could be a terrible idea and I'm hopping that some of you guys could propose a better solution if that's the case. My educated guess is that there's no expectation that the application will continue to run after the declared main, the only difference is that we're now calling
System.exit
so shutdown hooks run independently of random non daemon threads.Does this PR introduce any user-facing change?
I'm guessing that no. It does not introduce something that the user should notice(besides the fix)
How was this patch tested?
Took
SparkPI
example and removed thespark.stop()
call. Expected behaviour is that the driver exits after the job, but it doesn't. This patch fixed this.