-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2678][Core] Added "--" to prevent spark-submit from shadowing application options #1715
Conversation
QA tests have started for PR 1715. This patch merges cleanly. |
QA results for PR 1715: |
@@ -17,29 +17,14 @@ | |||
# limitations under the License. | |||
# | |||
|
|||
# Figure out where Spark is installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes seem unrelated. Is there a bug you can mention? Otherwise, could you call them out explicitly in the PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, you're right. I should update the PR description.
While working on #1620, the beeline
script had once been implemented with spark-submit
to avoid duplicated java
check and computing classpath, but then reverted because of the issue this PR is trying to fix (beeline --help
shows spark-submit
usage message).
And while working on this PR, I realized that beeline
is only a JDBC client, and unrelated to Spark, I can just start it with spark-class
. That's the reason why this change appear here.
A few nits. Also, just wanted to point out that one aspect of backwards compatibility, pointed out by Patrick in #1699, is that this will break things for people who are passing "--" to their apps today. I think that's very unlikely and pretty unavoidable, since spark-submit's current argument parsing is a little questionable to start with. |
I don't think it's unavoidable if we make specifying a main jar mutually exclusive with using |
Patrick, I'm not sure what you mean by that. How would someone not specify a main jar? |
I mean that they specify it not as a required argument, but as part of the
|
Discussed with @pwendell offline, record a summary here:
So basically, we'll have this:
|
Yeah - that was my suggestion. Btw, I don't think this is particularly elegant, but it does allow us to retain compatiblity. I think in our docs we could suggest that users use the new mechanism, so the main reason to preserve both options is for 1.0 deployments. |
Instead of omitting it, I'd suggest to add a new option
To be more precise, the new mode consists of both |
And, adding |
QA tests have started for PR 1715. This patch merges cleanly. |
QA tests have started for PR 1715. This patch merges cleanly. |
QA results for PR 1715: |
The build failures seem unrelated to this PR. I guess the Jenkins server runs out of file descriptors? |
test this please |
| --help, -h Show this help message and exit. | ||
| --verbose, -v Print additional debug output. | ||
| | ||
| --primary The primary jar file or Python file of the application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to explicitly mention that this is only needed if we need to pass --*
arguments to the application. Out of context, it's not clear why the user needs to do --primary
when they can just pass it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will reword the option description.
Updated the PR description to explain why --primary
is needed more clearly.
QA tests have started for PR 1715. This patch merges cleanly. |
@@ -322,6 +343,14 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) { | |||
val errMessage = s"Unrecognized option '$value'." | |||
SparkSubmit.printErrorAndExit(errMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user attempts to pass a --*
argument to her application she will reach this case. We should add here something that directs the user to use --primary
and --
to possibly specify $value
as an application option. Otherwise the only way for them to find out about these two options is if they went through the full help message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you indicating cases like spark-submit --class Foo app.jar --arg
? Actually --arg
can be passed to app.jar
correctly since inSparkOpts
is set to false
when --arg
is being processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed this locally with current master code. spark-submit
passes --*
like argument correctly to the application as long as the user put it after the primary resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. The whole point of this PR is not to make arguments like --hiveconf
work, but to make arguments like an application-specific version of --master
work. Say if the user did spark-submit --class Foo app.jar --master
then this will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and they only get here if they do spark-submit --class Foo --hiveconf /some/path app.jar
. Then it's fine to leave this as is.
@liancheng @pwendell Is the argument for adding a
It's impossible to know whether the application arguments are Have you considered other alternatives for the name "primary"? It makes sense to us because we refer the user application jar / python file as the primary resource, but it may not make as much sense to the user, especially when they're just specifying 1 resource. At the same time I'm not sure if there is a better alternative. (I'm thinking about |
@pwendell As for the difference between application jars (or secondary jars) and primary jar, what is important is the way
IMHO, these requirements are too implicit... |
@liancheng why not just have the jars listed on the classpath in the order they are given to us? This is also how classpaths work in general, when I run a java command, I don't give a special flag for the first element in the classpath, I just put it first. What you are proposing is tantamount to this:
isn't it? |
@andrewor14 I still don't understand how this is different. Basically, the JVM works such that you put a set of jars in order (indicating precedence) and then you can refer to a class defined in any of the jars. Why should we differ from the normal JVM semantics and have a special name for the first jar in the ordering? |
@pwendell OK, the I'll try to deliver a version as you described soon. |
@pwendell How about for python files? What if I have one.py and two.py that reference each other, and I want spark-submit to run the main method of one.py but not two.py. Since we don't specify a class, we can't distinguish between the two main methods unless we impose the requirement that the primary python file must be the first one. |
@andrewor14 I believe Patrick means:
|
Hm, I see. Even then we still need some kind of separator right? I thought the whole point of handling primary resources differently here (either under |
Let's say we're using the
Here we treat
Here we treat both From the user's perspective, the ways we specify the primary resource in these two commands are near equivalent. However, the arguments are actually parsed very differently. On the other hand, if we simply add a new Spark specific config ( |
Hey Cheng, a couple of comments on this:
|
BTW if we do this then there's no need to special-case |
Thanks Matei, Patrick's last a few comments have already convinced me to remove the "primary" notion from a user's perspective. And yes, |
Sure, that sounds good. The only thing if you use the first entry in |
JIRA issues: - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874) Related PR: - #1715 This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1801 from liancheng/spark-2874 and squashes the following commits: 8045d7a [Cheng Lian] Make sure test suites pass 8493a9e [Cheng Lian] Using eval to retain quoted arguments aed523f [Cheng Lian] Fixed typo in bin/spark-sql f12a0b1 [Cheng Lian] Worked arount SPARK-2678 daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts (cherry picked from commit a6cd311) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
JIRA issues: - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874) Related PR: - #1715 This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1801 from liancheng/spark-2874 and squashes the following commits: 8045d7a [Cheng Lian] Make sure test suites pass 8493a9e [Cheng Lian] Using eval to retain quoted arguments aed523f [Cheng Lian] Fixed typo in bin/spark-sql f12a0b1 [Cheng Lian] Worked arount SPARK-2678 daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1715, Closes #1864, and Closes #1861 Closes #1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
JIRA issues: - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874) Related PR: - apache#1715 This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1801 from liancheng/spark-2874 and squashes the following commits: 8045d7a [Cheng Lian] Make sure test suites pass 8493a9e [Cheng Lian] Using eval to retain quoted arguments aed523f [Cheng Lian] Fixed typo in bin/spark-sql f12a0b1 [Cheng Lian] Worked arount SPARK-2678 daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by apache#1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1715, Closes apache#1864, and Closes apache#1861 Closes apache#1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
JIRA issue: SPARK-2678
This PR aims to fix SPARK-2678 in a downward compatible way, and replaces PR #1699.
A new user application option passing style is introduced, now
spark-submit
can be used in two ways:Before this change,
spark-submit
shadows application options share the same name with those recognized bySparkSubmitArguments
(e.g.--master
,--help
,-h
,--conf
and-c
, etc.), and empty string is not allowed:In the new style, all arguments that follow a
--
are passed to user application.An unrelated change is that
bin/beeline
is made to delegate tospark-class
to avoid duplicatedjava
checking and classpath computing. At first It was supposed to usespark-submit
, but then I realized Beeline is only a JDBC client, which is actually not related tospark-submit
, usingspark-class
is enough.