[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484

vanzin · 2014-11-26T18:49:21Z

The security manager adds a lot of overhead to the runtime of the
app, and causes a severe performance regression. Even stubbing out
all unneeded methods (all except checkExit()) does not help.

So, instead, penalize users who do an explicit System.exit() by leaving
them in "undefined behavior" territory: if they do that, the Yarn
backend won't be able to report the final app status to the RM.
The result is that the final status of the application might not match
the user's expectations.

One side-effect of the change is that users who do an explicit
System.exit() will lose the AM retry functionality. Since there is
no way to know if the exit was because of success or failure, the
AM right now errs on the side of it being a successful exit.

The security manager adds a lot of overhead to the runtime of the app, and causes a severe performance regression. Even stubbing out all unneeded methods (all except checkExit()) does not help. So, instead, penalize users who do an explicit System.exit() by leaving them in "undefined behavior" territory: if they do that, the Yarn backend won't be able to report the final app status to the RM. The result is that the final status of the application might not match the user's expectations. One side-effect of the change is that users who do an explicit System.exit() will lose the AM retry functionality. Since there is no way to know if the exit was because of success or failure, the AM right now errs on the side of it being a successful exit.

vanzin · 2014-11-26T18:52:53Z

@tgravescs @andrewor14 @nishkamravi2 please take a look and let me know if you have questions.

I tested successful and failed apps in yarn client and cluster modes, but didn't try explicit System.exit() (as we want to tell users to not do that).

SparkQA · 2014-11-26T18:55:17Z

Test build #23901 has started for PR 3484 at commit 4198b3b.

This patch merges cleanly.

andrewor14 · 2014-11-26T19:17:28Z

Hey @vanzin quick question which conditions listed in @tgravescs' comment here does this no longer satisfy?

vanzin · 2014-11-26T19:20:55Z

cluster UserClass:
system.exit(1) - fail, 3 tries
system.exit(-1) - fail, 3 tries

These will be actually "success" now. All others should remain the same.

sryza · 2014-11-26T19:36:51Z

If this is behavior that we encourage users to avoid, I would think that defaulting to failure when System.exit is called would be preferable. Also, I'm guessing that user apps are more likely to call System.exit if they have an exit code?

vanzin · 2014-11-26T19:39:18Z

The main reason why I chose this approach is because if we do what you suggest, a successful app will actually be retried and will eventually fail. That sounds worse than not retrying a failed app.

andrewor14 · 2014-11-26T19:51:23Z

I see. The only thing this changes is the System.exit(x) case, meaning uncaught exceptions will still be properly treated as failures. This LGTM as a hot fix, though I would like confirmation from @tgravescs if possible. If we do merge this into 1.2.0 we should add this to one of the known issues in the release notes.

@vanzin @sryza If finalStatus defaults to UNKNOWN as before, then an application that calls System.exit(0) will be retried and then eventually fail. This is what you mean by "successful app", correct? If so then I think it's better to change the default value to SUCCESS as this PR has done, so that System.exit(x) will always be treated as failures (even if x == 0).

vanzin · 2014-11-26T20:07:17Z

Actually the current code will fail the app (not succeed) regardless of what System.exit() says, because of the check on L108... so this actually does what Sandy suggests. If we want to not retry successful apps that exit with System.exit(), we'll need to change more code.

So currently this does not work for this case in Tom's list:

cluster UserClass:
system.exit(0) - success

The app will be re-run and eventually fail after the retries.

andrewor14 · 2014-11-26T20:12:03Z

Wait, actually that's pretty bad right? If I run the same user code from 1.1 that calls System.exit(0) now it will be re-run many times.

vanzin · 2014-11-26T20:14:21Z

Yeah, I'm not a fan of that behavior. Let me look at changing it, but also waiting to hear back from Tom.

andrewor14 · 2014-11-26T20:16:22Z

Ok, until we figure out a way to avoid that I retract my earlier LGTM because this is currently a regression in a different way. Thanks for digging into this.

SparkQA · 2014-11-26T20:17:51Z

Test build #23901 has finished for PR 3484 at commit 4198b3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T20:17:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23901/
Test PASSed.

vanzin · 2014-11-26T21:16:50Z

Ok, updated to match what was discussed. Tested with System.exit(0) and System.exit(1) (both result in SUCCEEDED app).

SparkQA · 2014-11-26T21:20:14Z

Test build #23906 has started for PR 3484 at commit 21f2502.

This patch merges cleanly.

SparkQA · 2014-11-26T22:40:02Z

Test build #23906 has finished for PR 3484 at commit 21f2502.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-26T22:40:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23906/
Test PASSed.

tgravescs · 2014-11-27T13:41:13Z

so I'm trying to understand exactly when this affects performance. I saw in the jira you saw its when you use java word count. So is this only an issue when things do a collect back to the ApplicationMaster or does it affect all driver performance? 2x seems huge for just having the securityManager there.

SIGH.. thats unfortunate. We should basically revert back to the 1.1 behavior which is to mark things as succeeded even though they may have failed.

pwendell · 2014-11-27T17:06:56Z

Hey @tgravescs do you mind if I pull this in for a release candidate? I'm planning to cut one today. If there is some open discussion in this issue though I can defer it.

pwendell · 2014-11-27T18:26:11Z

Actually reading more in this thread I'll hold off until you or @andrewor14 signs off.

tgravescs · 2014-11-28T16:46:32Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

-            // this shouldn't ever happen, but if it does assume weird failure
-            finish(FinalApplicationStatus.FAILED,
+            // This happens when the user application calls System.exit(). We have the choice
+            // of either failing of succeeding at this point. We report success to avoid


nit - of should be or (failing or succeeding)

tgravescs · 2014-11-28T17:10:07Z

mostly looks good. I think we should fix the exit code before pulling it in.

pwendell · 2014-11-28T20:10:47Z

Okay I'd like to pull this in for rc1, so I'm going to merge this with the edits that Tom suggested. Feel free to open up additional patches.

The security manager adds a lot of overhead to the runtime of the app, and causes a severe performance regression. Even stubbing out all unneeded methods (all except checkExit()) does not help. So, instead, penalize users who do an explicit System.exit() by leaving them in "undefined behavior" territory: if they do that, the Yarn backend won't be able to report the final app status to the RM. The result is that the final status of the application might not match the user's expectations. One side-effect of the change is that users who do an explicit System.exit() will lose the AM retry functionality. Since there is no way to know if the exit was because of success or failure, the AM right now errs on the side of it being a successful exit. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3484 from vanzin/SPARK-4584 and squashes the following commits: 21f2502 [Marcelo Vanzin] Do not retry apps that use System.exit(). 4198b3b [Marcelo Vanzin] [SPARK-4584] [yarn] Remove security manager from Yarn AM. (cherry picked from commit 915f8ee) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

tgravescs · 2014-12-01T14:27:33Z

@vanzin I would still be curious if you have more details on the exact performance impact?

vanzin · 2014-12-01T18:05:31Z

@tgravescs Nishkam (who filed the bug) has most of the background info on the performance impact. But, from a high level, installing a security manager seems to introduce a non-trivial performance penatly into any application. Even when the methods of the security manager are no-ops, it seems just having to check them all the time adds a lot of overhead, since it's done even for operations that should be generally very fast like accessing fields via reflection.

If you're really really curious we can add instrumentation to measure the performance later when we're not busy trying to get a release out.

Do not retry apps that use System.exit().

21f2502

tgravescs reviewed Nov 28, 2014
View reviewed changes

asfgit closed this in 915f8ee Nov 28, 2014

tgravescs mentioned this pull request Dec 1, 2014

[YARN][SPARK-3293]Fix yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client #3508

Closed

vanzin deleted the SPARK-4584 branch December 24, 2014 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484

[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484

vanzin commented Nov 26, 2014

vanzin commented Nov 26, 2014

SparkQA commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

vanzin commented Nov 26, 2014

sryza commented Nov 26, 2014

vanzin commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

vanzin commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

vanzin commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

vanzin commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

tgravescs commented Nov 27, 2014

pwendell commented Nov 27, 2014

pwendell commented Nov 27, 2014

tgravescs Nov 28, 2014

tgravescs commented Nov 28, 2014

pwendell commented Nov 28, 2014

tgravescs commented Dec 1, 2014

vanzin commented Dec 1, 2014

[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484

[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484

Conversation

vanzin commented Nov 26, 2014

vanzin commented Nov 26, 2014

SparkQA commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

vanzin commented Nov 26, 2014

sryza commented Nov 26, 2014

vanzin commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

vanzin commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

vanzin commented Nov 26, 2014

andrewor14 commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

vanzin commented Nov 26, 2014

SparkQA commented Nov 26, 2014

SparkQA commented Nov 26, 2014

AmplabJenkins commented Nov 26, 2014

tgravescs commented Nov 27, 2014

pwendell commented Nov 27, 2014

pwendell commented Nov 27, 2014

tgravescs Nov 28, 2014

Choose a reason for hiding this comment

tgravescs commented Nov 28, 2014

pwendell commented Nov 28, 2014

tgravescs commented Dec 1, 2014

vanzin commented Dec 1, 2014