[SPARK-2678][Core] Added "--" to prevent spark-submit from shadowing application options #1715

liancheng · 2014-08-01T10:34:45Z

This PR aims to fix SPARK-2678 in a downward compatible way, and replaces PR #1699.

A new user application option passing style is introduced, now spark-submit can be used in two ways:

# The old, primary resource as positional argument style
spark-submit [options] <app jar | python file> [app options]

# The new, --primary and -- style
spark-submit [options] --primary <app jar | python file> -- [app options]

Before this change, spark-submit shadows application options share the same name with those recognized by SparkSubmitArguments (e.g. --master, --help, -h, --conf and -c, etc.), and empty string is not allowed:

# The "--help" option is captured by spark-submit
bin/spark-submit --class Foo userapp.jar --help

# The empty string is not passed to user program
bin/spark-submit --class Foo userapp.jar --some-arg ""

In the new style, all arguments that follow a -- are passed to user application.

# The "--help" option is now passed to userapp.jar
bin/spark-submit --class Foo --primary userapp.jar -- --help

# Empty string is also allowed
bin/spark-submit --class Foo --primary userapp.jar -- --arg ""

An unrelated change is that bin/beeline is made to delegate to spark-class to avoid duplicated java checking and classpath computing. At first It was supposed to use spark-submit, but then I realized Beeline is only a JDBC client, which is actually not related to spark-submit, using spark-class is enough.

SparkQA · 2014-08-01T10:39:02Z

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17671/consoleFull

SparkQA · 2014-08-01T11:50:38Z

QA results for PR 1715:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17671/consoleFull

vanzin · 2014-08-01T18:15:47Z

bin/beeline

@@ -17,29 +17,14 @@
 # limitations under the License.
 #

-# Figure out where Spark is installed


These changes seem unrelated. Is there a bug you can mention? Otherwise, could you call them out explicitly in the PR description?

My bad, you're right. I should update the PR description.

While working on #1620, the beeline script had once been implemented with spark-submit to avoid duplicated java check and computing classpath, but then reverted because of the issue this PR is trying to fix (beeline --help shows spark-submit usage message).

And while working on this PR, I realized that beeline is only a JDBC client, and unrelated to Spark, I can just start it with spark-class. That's the reason why this change appear here.

vanzin · 2014-08-01T18:26:52Z

A few nits. Also, just wanted to point out that one aspect of backwards compatibility, pointed out by Patrick in #1699, is that this will break things for people who are passing "--" to their apps today. I think that's very unlikely and pretty unavoidable, since spark-submit's current argument parsing is a little questionable to start with.

pwendell · 2014-08-01T21:46:55Z

I don't think it's unavoidable if we make specifying a main jar mutually exclusive with using --

vanzin · 2014-08-01T21:50:43Z

Patrick, I'm not sure what you mean by that. How would someone not specify a main jar?

pwendell · 2014-08-02T03:15:17Z

I mean that they specify it not as a required argument, but as part of the --jars option we already expose:

./spark-submit --master local --jars myJar.jar -- --userOpt a --userOpt b

liancheng · 2014-08-02T06:51:41Z

Discussed with @pwendell offline, record a summary here:

We leave the current option passing mode as is to keep downward compatibility
Introducing -- as a new application option passing mode, but should be used mutually exclusively with primary resource (either Jar file or Python file).

Users need to use --jars and/or --py-files to specify the primary resource if they'd like to use -- as application option separator.

So basically, we'll have this:

if primary resource exists:
  resort to the old option passing mode
else if -- exists:
  pass everything after -- to the application
  if --py-files exists:
    isPython = true
  else:
    isPython = false
else:
  report error

pwendell · 2014-08-02T07:03:17Z

Yeah - that was my suggestion. Btw, I don't think this is particularly elegant, but it does allow us to retain compatiblity. I think in our docs we could suggest that users use the new mechanism, so the main reason to preserve both options is for 1.0 deployments.

liancheng · 2014-08-02T10:51:36Z

SparkSubmit uses primaryResource in various places, omitting primary resource makes the change hard to be correct in every corner case. Especially, in Python applications, primary resource is used to located the primary Python script. If we simply put it in --py-files, we'll have to add more constraints like "Python primary resource put in --py-files must appear as the first file", which can be rather unintuitive for users.

Instead of omitting it, I'd suggest to add a new option --primary to specify the primary resource when -- is used. Thus, the new option passing style looks something like:

bin/spark-submit --class Foo --primary user.jar --master local -- --arg1 --arg2

To be more precise, the new mode consists of both --primary and --, and is used mutually exclusively with the positional primary resource mode. In this way, we needn't to change SparkSubmit and can also keep full downward compatibility.

liancheng · 2014-08-02T10:54:49Z

And, adding --primary also helps to retain support for specifying spark-internal, spark-shell and pyspark-shell as primary resource.

SparkQA · 2014-08-02T13:29:00Z

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17768/consoleFull

SparkQA · 2014-08-02T13:39:06Z

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17769/consoleFull

SparkQA · 2014-08-02T14:41:12Z

QA results for PR 1715:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17768/consoleFull

liancheng · 2014-08-02T16:10:15Z

The build failures seem unrelated to this PR. I guess the Jenkins server runs out of file descriptors?

andrewor14 · 2014-08-02T16:54:41Z

test this please

andrewor14 · 2014-08-02T16:58:14Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala

+        |  --help, -h                  Show this help message and exit.
+        |  --verbose, -v               Print additional debug output.
+        |
+        |  --primary                   The primary jar file or Python file of the application.


You may need to explicitly mention that this is only needed if we need to pass --* arguments to the application. Out of context, it's not clear why the user needs to do --primary when they can just pass it directly.

Thanks, will reword the option description.

Updated the PR description to explain why --primary is needed more clearly.

SparkQA · 2014-08-02T16:59:15Z

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17771/consoleFull

andrewor14 · 2014-08-02T17:08:11Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala

@@ -322,6 +343,14 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) {
              val errMessage = s"Unrecognized option '$value'."
              SparkSubmit.printErrorAndExit(errMessage)


If the user attempts to pass a --* argument to her application she will reach this case. We should add here something that directs the user to use --primary and -- to possibly specify $value as an application option. Otherwise the only way for them to find out about these two options is if they went through the full help message.

Are you indicating cases like spark-submit --class Foo app.jar --arg? Actually --arg can be passed to app.jar correctly since inSparkOpts is set to false when --arg is being processed.

Confirmed this locally with current master code. spark-submit passes --* like argument correctly to the application as long as the user put it after the primary resource.

Ah I see. The whole point of this PR is not to make arguments like --hiveconf work, but to make arguments like an application-specific version of --master work. Say if the user did spark-submit --class Foo app.jar --master then this will fail.

and they only get here if they do spark-submit --class Foo --hiveconf /some/path app.jar. Then it's fine to leave this as is.

andrewor14 · 2014-08-02T17:26:40Z

@liancheng @pwendell Is the argument for adding a --primary that the application might accept -- as an argument? For example, if the user does the following:

bin/spark-submit app.jar apparg1 -- apparg2

It's impossible to know whether the application arguments are apparg1 -- apparg2 or just apparg1 apparg2. Then it seems that -- should always be used with --primary. Maybe we should make this point more explicit in the docs and error messages (i.e. don't only mention one or the other).

Have you considered other alternatives for the name "primary"? It makes sense to us because we refer the user application jar / python file as the primary resource, but it may not make as much sense to the user, especially when they're just specifying 1 resource. At the same time I'm not sure if there is a better alternative. (I'm thinking about --application-jar, but that means we'll need something like --application-file for python.) Maybe the existing name is fine and this is not worth the extra hassle.

liancheng · 2014-08-04T05:00:35Z

@pwendell As for the difference between application jars (or secondary jars) and primary jar, what is important is the way SparkSubmit handles them: primary jar is ensured to be the first jar in the CLASSPATH, so that the main class specified by --class will never be shadowed by some class of the same name in some secondary jar. So removing primary resource from the user interface (command line options) means:

Users must add the primary resource (either jar file or Python file) in front of all other jars/python files
Internally, the first jar or first Python file should be considered as primary

IMHO, these requirements are too implicit...

pwendell · 2014-08-04T07:12:24Z

@liancheng why not just have the jars listed on the classpath in the order they are given to us? This is also how classpaths work in general, when I run a java command, I don't give a special flag for the first element in the classpath, I just put it first.

What you are proposing is tantamount to this:

java -firstCpElement "foo.jar" -cp "bar.jar,baz.jar"

isn't it?

pwendell · 2014-08-04T07:13:29Z

@andrewor14 I still don't understand how this is different. Basically, the JVM works such that you put a set of jars in order (indicating precedence) and then you can refer to a class defined in any of the jars. Why should we differ from the normal JVM semantics and have a special name for the first jar in the ordering?

liancheng · 2014-08-04T12:55:47Z

@pwendell OK, the java -firstCpElement example really convinced me :) I used to think asking users to care about the order of the jars is a little too much, but every sane programmer over JVM should care about this anyway.

I'll try to deliver a version as you described soon.

andrewor14 · 2014-08-04T17:16:49Z

@pwendell How about for python files? What if I have one.py and two.py that reference each other, and I want spark-submit to run the main method of one.py but not two.py. Since we don't specify a class, we can't distinguish between the two main methods unless we impose the requirement that the primary python file must be the first one.

liancheng · 2014-08-04T17:23:06Z

@andrewor14 I believe Patrick means:

For Scala/Java applications, the primary jar should appear as the 1st entry of --jars
For Python applications, the primary Python file should appear as the 1st entry of --py-files

andrewor14 · 2014-08-04T17:32:21Z

Hm, I see. Even then we still need some kind of separator right? I thought the whole point of handling primary resources differently here (either under --primary or --jars or --py-files) is to provide backwards compatibility in case the user application uses --. If we pick a Spark specific enough separator, doesn't this issue go away?

andrewor14 · 2014-08-04T17:53:08Z

Let's say we're using the --jars approach, and I run the following two commands. (Correct me if I'm understanding your proposal incorrectly; I am basing this off @liancheng's pseudocode in an earlier comment.) Here I am assuming that --arg1 and --arg2 are also spark-submit arguments:

bin/spark-submit --jars app.jar --arg1 -- --arg2

Here we treat --arg1 as a spark-submit argument, but pass --arg2 to the application. The user may also specify the primary jar explicitly as follows:

bin/spark-submit app.jar --arg1 -- --arg2

Here we treat both --arg1 and --arg2 as spark-submit arguments, but we may pass -- to the application depending on what --arg1 is. For instance, if --arg1 is --name and takes in a value, then -- will become the app name. Otherwise, if --arg1 is --supervise, which does not take in a value, then -- will be passed to the application.

From the user's perspective, the ways we specify the primary resource in these two commands are near equivalent. However, the arguments are actually parsed very differently. On the other hand, if we simply add a new Spark specific config (--spark-application-args or something) and keep the way we specify the primary resource the same, we get backwards compatibility for free while providing this new functionality. This latter approach just seems simpler to me.

mateiz · 2014-08-05T00:52:36Z

Hey Cheng, a couple of comments on this:

In Java, we should be able to run without a --primary if only --jars and --main-class are given. I'd rather support that than add the primary argument here.
If we do this, then the --primary argument becomes relevant only for Python, and we can give it a specific name there, e.g. --main-file. Another option is to add --module and execute the __main__ function in a specific module with python -m module-name. This is the more Pythonic way to do it AFAIK but it will also be more work, so I'd just stick to --main-file in 1.1 and add --module later.

mateiz · 2014-08-05T00:54:10Z

BTW if we do this then there's no need to special-case spark-internal and such, right? We can just pass a class name. Or let me know if that's not true. In that case I'd say we should have a "for Spark use only" parameter here but not expose --primary.

liancheng · 2014-08-05T03:27:23Z

Thanks Matei, Patrick's last a few comments have already convinced me to remove the "primary" notion from a user's perspective. And yes, spark-internal can be removed in this way, spark-shell and pyspark-shell can also be removed by checking --class in Spark scripts. Internally, to keep code relies on primaryResource intact (one example is the cluster deploy mode in a standalone cluster), we can still pick the first entry in --jar for Java/Scala apps and --main-file for Python apps as primary.

mateiz · 2014-08-05T03:46:08Z

Sure, that sounds good. The only thing if you use the first entry in --jar is I wouldn't automatically look for a main class in that (there's this part that checks the JAR manifest). Instead, make them use --main-class in that case. Otherwise it's a bit confusing that you give only JARs and it starts running some program.

JIRA issues: - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874) Related PR: - #1715 This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1801 from liancheng/spark-2874 and squashes the following commits: 8045d7a [Cheng Lian] Make sure test suites pass 8493a9e [Cheng Lian] Using eval to retain quoted arguments aed523f [Cheng Lian] Fixed typo in bin/spark-sql f12a0b1 [Cheng Lian] Worked arount SPARK-2678 daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts (cherry picked from commit a6cd311) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

JIRA issues: - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874) Related PR: - #1715 This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1801 from liancheng/spark-2874 and squashes the following commits: 8045d7a [Cheng Lian] Make sure test suites pass 8493a9e [Cheng Lian] Using eval to retain quoted arguments aed523f [Cheng Lian] Fixed typo in bin/spark-sql f12a0b1 [Cheng Lian] Worked arount SPARK-2678 daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts

…l is broken

As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1715, Closes #1864, and Closes #1861 Closes #1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options

JIRA issues: - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874) Related PR: - apache#1715 This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1801 from liancheng/spark-2874 and squashes the following commits: 8045d7a [Cheng Lian] Make sure test suites pass 8493a9e [Cheng Lian] Using eval to retain quoted arguments aed523f [Cheng Lian] Fixed typo in bin/spark-sql f12a0b1 [Cheng Lian] Worked arount SPARK-2678 daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts

As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by apache#1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1715, Closes apache#1864, and Closes apache#1861 Closes apache#1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options

Added "--" to prevent spark-submit from shadowing application options

2774138

liancheng mentioned this pull request Aug 1, 2014

[SPARK-2678][Core] Prevents spark-submit from shadowing application options #1699

Closed

vanzin reviewed Aug 1, 2014
View reviewed changes

Added --primary to retained full downward compatibility

e246367

Allowed empty string as user application argument

6c23c52

andrewor14 reviewed Aug 2, 2014
View reviewed changes

liancheng mentioned this pull request Aug 6, 2014

[SPARK-2678][Core][SQL] A workaround for SPARK-2678 #1801

Closed

chutium added a commit to chutium/spark that referenced this pull request Aug 8, 2014

[SPARK-2678] Follow-up commit for apache#1715 apache#1801, spark-shel…

efa8f52

…l is broken

liancheng mentioned this pull request Aug 9, 2014

[SPARK-2894][Core] Fixes spark-shell and pyspark CLI options #1864

Closed

asfgit closed this in 4f4a988 Aug 10, 2014

liancheng deleted the spark-2678 branch September 24, 2014 00:09

		@@ -322,6 +343,14 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) {
		val errMessage = s"Unrecognized option '$value'."
		SparkSubmit.printErrorAndExit(errMessage)

[SPARK-2678][Core] Added "--" to prevent spark-submit from shadowing application options #1715

[SPARK-2678][Core] Added "--" to prevent spark-submit from shadowing application options #1715

Conversation

liancheng commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Aug 1, 2014

pwendell commented Aug 1, 2014

vanzin commented Aug 1, 2014

pwendell commented Aug 2, 2014

liancheng commented Aug 2, 2014

pwendell commented Aug 2, 2014

liancheng commented Aug 2, 2014

liancheng commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

SparkQA commented Aug 2, 2014

liancheng commented Aug 2, 2014

andrewor14 commented Aug 2, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 2, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewor14 commented Aug 2, 2014

liancheng commented Aug 4, 2014

pwendell commented Aug 4, 2014

pwendell commented Aug 4, 2014

liancheng commented Aug 4, 2014

andrewor14 commented Aug 4, 2014

liancheng commented Aug 4, 2014

andrewor14 commented Aug 4, 2014

andrewor14 commented Aug 4, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014

liancheng commented Aug 5, 2014

mateiz commented Aug 5, 2014