Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Nov 26, 2014

The security manager adds a lot of overhead to the runtime of the
app, and causes a severe performance regression. Even stubbing out
all unneeded methods (all except checkExit()) does not help.

So, instead, penalize users who do an explicit System.exit() by leaving
them in "undefined behavior" territory: if they do that, the Yarn
backend won't be able to report the final app status to the RM.
The result is that the final status of the application might not match
the user's expectations.

One side-effect of the change is that users who do an explicit
System.exit() will lose the AM retry functionality. Since there is
no way to know if the exit was because of success or failure, the
AM right now errs on the side of it being a successful exit.

The security manager adds a lot of overhead to the runtime of the
app, and causes a severe performance regression. Even stubbing out
all unneeded methods (all except checkExit()) does not help.

So, instead, penalize users who do an explicit System.exit() by leaving
them in "undefined behavior" territory: if they do that, the Yarn
backend won't be able to report the final app status to the RM.
The result is that the final status of the application might not match
the user's expectations.

One side-effect of the change is that users who do an explicit
System.exit() will lose the AM retry functionality. Since there is
no way to know if the exit was because of success or failure, the
AM right now errs on the side of it being a successful exit.
@vanzin
Copy link
Contributor Author

vanzin commented Nov 26, 2014

@tgravescs @andrewor14 @nishkamravi2 please take a look and let me know if you have questions.

I tested successful and failed apps in yarn client and cluster modes, but didn't try explicit System.exit() (as we want to tell users to not do that).

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23901 has started for PR 3484 at commit 4198b3b.

  • This patch merges cleanly.

@andrewor14
Copy link
Contributor

Hey @vanzin quick question which conditions listed in @tgravescs' comment here does this no longer satisfy?

@vanzin
Copy link
Contributor Author

vanzin commented Nov 26, 2014

cluster UserClass:
system.exit(1) - fail, 3 tries
system.exit(-1) - fail, 3 tries

These will be actually "success" now. All others should remain the same.

@sryza
Copy link
Contributor

sryza commented Nov 26, 2014

If this is behavior that we encourage users to avoid, I would think that defaulting to failure when System.exit is called would be preferable. Also, I'm guessing that user apps are more likely to call System.exit if they have an exit code?

@vanzin
Copy link
Contributor Author

vanzin commented Nov 26, 2014

The main reason why I chose this approach is because if we do what you suggest, a successful app will actually be retried and will eventually fail. That sounds worse than not retrying a failed app.

@andrewor14
Copy link
Contributor

I see. The only thing this changes is the System.exit(x) case, meaning uncaught exceptions will still be properly treated as failures. This LGTM as a hot fix, though I would like confirmation from @tgravescs if possible. If we do merge this into 1.2.0 we should add this to one of the known issues in the release notes.

@vanzin @sryza If finalStatus defaults to UNKNOWN as before, then an application that calls System.exit(0) will be retried and then eventually fail. This is what you mean by "successful app", correct? If so then I think it's better to change the default value to SUCCESS as this PR has done, so that System.exit(x) will always be treated as failures (even if x == 0).

@vanzin
Copy link
Contributor Author

vanzin commented Nov 26, 2014

Actually the current code will fail the app (not succeed) regardless of what System.exit() says, because of the check on L108... so this actually does what Sandy suggests. If we want to not retry successful apps that exit with System.exit(), we'll need to change more code.

So currently this does not work for this case in Tom's list:

cluster UserClass:
system.exit(0) - success

The app will be re-run and eventually fail after the retries.

@andrewor14
Copy link
Contributor

Wait, actually that's pretty bad right? If I run the same user code from 1.1 that calls System.exit(0) now it will be re-run many times.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 26, 2014

Yeah, I'm not a fan of that behavior. Let me look at changing it, but also waiting to hear back from Tom.

@andrewor14
Copy link
Contributor

Ok, until we figure out a way to avoid that I retract my earlier LGTM because this is currently a regression in a different way. Thanks for digging into this.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23901 has finished for PR 3484 at commit 4198b3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23901/
Test PASSed.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 26, 2014

Ok, updated to match what was discussed. Tested with System.exit(0) and System.exit(1) (both result in SUCCEEDED app).

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23906 has started for PR 3484 at commit 21f2502.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23906 has finished for PR 3484 at commit 21f2502.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23906/
Test PASSed.

@tgravescs
Copy link
Contributor

so I'm trying to understand exactly when this affects performance. I saw in the jira you saw its when you use java word count. So is this only an issue when things do a collect back to the ApplicationMaster or does it affect all driver performance? 2x seems huge for just having the securityManager there.

SIGH.. thats unfortunate. We should basically revert back to the 1.1 behavior which is to mark things as succeeded even though they may have failed.

@pwendell
Copy link
Contributor

Hey @tgravescs do you mind if I pull this in for a release candidate? I'm planning to cut one today. If there is some open discussion in this issue though I can defer it.

@pwendell
Copy link
Contributor

Actually reading more in this thread I'll hold off until you or @andrewor14 signs off.

// this shouldn't ever happen, but if it does assume weird failure
finish(FinalApplicationStatus.FAILED,
// This happens when the user application calls System.exit(). We have the choice
// of either failing of succeeding at this point. We report success to avoid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - of should be or (failing or succeeding)

@tgravescs
Copy link
Contributor

mostly looks good. I think we should fix the exit code before pulling it in.

@pwendell
Copy link
Contributor

Okay I'd like to pull this in for rc1, so I'm going to merge this with the edits that Tom suggested. Feel free to open up additional patches.

asfgit pushed a commit that referenced this pull request Nov 28, 2014
The security manager adds a lot of overhead to the runtime of the
app, and causes a severe performance regression. Even stubbing out
all unneeded methods (all except checkExit()) does not help.

So, instead, penalize users who do an explicit System.exit() by leaving
them in "undefined behavior" territory: if they do that, the Yarn
backend won't be able to report the final app status to the RM.
The result is that the final status of the application might not match
the user's expectations.

One side-effect of the change is that users who do an explicit
System.exit() will lose the AM retry functionality. Since there is
no way to know if the exit was because of success or failure, the
AM right now errs on the side of it being a successful exit.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3484 from vanzin/SPARK-4584 and squashes the following commits:

21f2502 [Marcelo Vanzin] Do not retry apps that use System.exit().
4198b3b [Marcelo Vanzin] [SPARK-4584] [yarn] Remove security manager from Yarn AM.

(cherry picked from commit 915f8ee)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
@asfgit asfgit closed this in 915f8ee Nov 28, 2014
@tgravescs
Copy link
Contributor

@vanzin I would still be curious if you have more details on the exact performance impact?

@vanzin
Copy link
Contributor Author

vanzin commented Dec 1, 2014

@tgravescs Nishkam (who filed the bug) has most of the background info on the performance impact. But, from a high level, installing a security manager seems to introduce a non-trivial performance penatly into any application. Even when the methods of the security manager are no-ops, it seems just having to check them all the time adds a lot of overhead, since it's done even for operations that should be generally very fast like accessing fields via reflection.

If you're really really curious we can add instrumentation to measure the performance later when we're not busy trying to get a release out.

@vanzin vanzin deleted the SPARK-4584 branch December 24, 2014 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants