Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2678][Core] Added "--" to prevent spark-submit from shadowing application options #1715

Closed
wants to merge 4 commits into from

Conversation

liancheng
Copy link
Contributor

JIRA issue: SPARK-2678

This PR aims to fix SPARK-2678 in a downward compatible way, and replaces PR #1699.

A new user application option passing style is introduced, now spark-submit can be used in two ways:

# The old, primary resource as positional argument style
spark-submit [options] <app jar | python file> [app options]

# The new, --primary and -- style
spark-submit [options] --primary <app jar | python file> -- [app options]

Before this change, spark-submit shadows application options share the same name with those recognized by SparkSubmitArguments (e.g. --master, --help, -h, --conf and -c, etc.), and empty string is not allowed:

# The "--help" option is captured by spark-submit
bin/spark-submit --class Foo userapp.jar --help

# The empty string is not passed to user program
bin/spark-submit --class Foo userapp.jar --some-arg ""

In the new style, all arguments that follow a -- are passed to user application.

# The "--help" option is now passed to userapp.jar
bin/spark-submit --class Foo --primary userapp.jar -- --help

# Empty string is also allowed
bin/spark-submit --class Foo --primary userapp.jar -- --arg ""

An unrelated change is that bin/beeline is made to delegate to spark-class to avoid duplicated java checking and classpath computing. At first It was supposed to use spark-submit, but then I realized Beeline is only a JDBC client, which is actually not related to spark-submit, using spark-class is enough.

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17671/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA results for PR 1715:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17671/consoleFull

@@ -17,29 +17,14 @@
# limitations under the License.
#

# Figure out where Spark is installed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem unrelated. Is there a bug you can mention? Otherwise, could you call them out explicitly in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, you're right. I should update the PR description.

While working on #1620, the beeline script had once been implemented with spark-submit to avoid duplicated java check and computing classpath, but then reverted because of the issue this PR is trying to fix (beeline --help shows spark-submit usage message).

And while working on this PR, I realized that beeline is only a JDBC client, and unrelated to Spark, I can just start it with spark-class. That's the reason why this change appear here.

@vanzin
Copy link
Contributor

vanzin commented Aug 1, 2014

A few nits. Also, just wanted to point out that one aspect of backwards compatibility, pointed out by Patrick in #1699, is that this will break things for people who are passing "--" to their apps today. I think that's very unlikely and pretty unavoidable, since spark-submit's current argument parsing is a little questionable to start with.

@pwendell
Copy link
Contributor

pwendell commented Aug 1, 2014

I don't think it's unavoidable if we make specifying a main jar mutually exclusive with using --

@vanzin
Copy link
Contributor

vanzin commented Aug 1, 2014

Patrick, I'm not sure what you mean by that. How would someone not specify a main jar?

@pwendell
Copy link
Contributor

pwendell commented Aug 2, 2014

I mean that they specify it not as a required argument, but as part of the --jars option we already expose:

./spark-submit --master local --jars myJar.jar -- --userOpt a --userOpt b

@liancheng
Copy link
Contributor Author

Discussed with @pwendell offline, record a summary here:

  1. We leave the current option passing mode as is to keep downward compatibility

  2. Introducing -- as a new application option passing mode, but should be used mutually exclusively with primary resource (either Jar file or Python file).

    Users need to use --jars and/or --py-files to specify the primary resource if they'd like to use -- as application option separator.

So basically, we'll have this:

if primary resource exists:
  resort to the old option passing mode
else if -- exists:
  pass everything after -- to the application
  if --py-files exists:
    isPython = true
  else:
    isPython = false
else:
  report error

@pwendell
Copy link
Contributor

pwendell commented Aug 2, 2014

Yeah - that was my suggestion. Btw, I don't think this is particularly elegant, but it does allow us to retain compatiblity. I think in our docs we could suggest that users use the new mechanism, so the main reason to preserve both options is for 1.0 deployments.

@liancheng
Copy link
Contributor Author

SparkSubmit uses primaryResource in various places, omitting primary resource makes the change hard to be correct in every corner case. Especially, in Python applications, primary resource is used to located the primary Python script. If we simply put it in --py-files, we'll have to add more constraints like "Python primary resource put in --py-files must appear as the first file", which can be rather unintuitive for users.

Instead of omitting it, I'd suggest to add a new option --primary to specify the primary resource when -- is used. Thus, the new option passing style looks something like:

bin/spark-submit --class Foo --primary user.jar --master local -- --arg1 --arg2

To be more precise, the new mode consists of both --primary and --, and is used mutually exclusively with the positional primary resource mode. In this way, we needn't to change SparkSubmit and can also keep full downward compatibility.

@liancheng
Copy link
Contributor Author

And, adding --primary also helps to retain support for specifying spark-internal, spark-shell and pyspark-shell as primary resource.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17768/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17769/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1715:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17768/consoleFull

@liancheng
Copy link
Contributor Author

The build failures seem unrelated to this PR. I guess the Jenkins server runs out of file descriptors?

@andrewor14
Copy link
Contributor

test this please

| --help, -h Show this help message and exit.
| --verbose, -v Print additional debug output.
|
| --primary The primary jar file or Python file of the application.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to explicitly mention that this is only needed if we need to pass --* arguments to the application. Out of context, it's not clear why the user needs to do --primary when they can just pass it directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will reword the option description.

Updated the PR description to explain why --primary is needed more clearly.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1715. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17771/consoleFull

@@ -322,6 +343,14 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) {
val errMessage = s"Unrecognized option '$value'."
SparkSubmit.printErrorAndExit(errMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user attempts to pass a --* argument to her application she will reach this case. We should add here something that directs the user to use --primary and -- to possibly specify $value as an application option. Otherwise the only way for them to find out about these two options is if they went through the full help message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you indicating cases like spark-submit --class Foo app.jar --arg? Actually --arg can be passed to app.jar correctly since inSparkOpts is set to false when --arg is being processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this locally with current master code. spark-submit passes --* like argument correctly to the application as long as the user put it after the primary resource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. The whole point of this PR is not to make arguments like --hiveconf work, but to make arguments like an application-specific version of --master work. Say if the user did spark-submit --class Foo app.jar --master then this will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and they only get here if they do spark-submit --class Foo --hiveconf /some/path app.jar. Then it's fine to leave this as is.

@andrewor14
Copy link
Contributor

@liancheng @pwendell Is the argument for adding a --primary that the application might accept -- as an argument? For example, if the user does the following:

bin/spark-submit app.jar apparg1 -- apparg2

It's impossible to know whether the application arguments are apparg1 -- apparg2 or just apparg1 apparg2. Then it seems that -- should always be used with --primary. Maybe we should make this point more explicit in the docs and error messages (i.e. don't only mention one or the other).

Have you considered other alternatives for the name "primary"? It makes sense to us because we refer the user application jar / python file as the primary resource, but it may not make as much sense to the user, especially when they're just specifying 1 resource. At the same time I'm not sure if there is a better alternative. (I'm thinking about --application-jar, but that means we'll need something like --application-file for python.) Maybe the existing name is fine and this is not worth the extra hassle.

@liancheng
Copy link
Contributor Author

@pwendell As for the difference between application jars (or secondary jars) and primary jar, what is important is the way SparkSubmit handles them: primary jar is ensured to be the first jar in the CLASSPATH, so that the main class specified by --class will never be shadowed by some class of the same name in some secondary jar. So removing primary resource from the user interface (command line options) means:

  1. Users must add the primary resource (either jar file or Python file) in front of all other jars/python files
  2. Internally, the first jar or first Python file should be considered as primary

IMHO, these requirements are too implicit...

@pwendell
Copy link
Contributor

pwendell commented Aug 4, 2014

@liancheng why not just have the jars listed on the classpath in the order they are given to us? This is also how classpaths work in general, when I run a java command, I don't give a special flag for the first element in the classpath, I just put it first.

What you are proposing is tantamount to this:

java -firstCpElement "foo.jar" -cp "bar.jar,baz.jar"

isn't it?

@pwendell
Copy link
Contributor

pwendell commented Aug 4, 2014

@andrewor14 I still don't understand how this is different. Basically, the JVM works such that you put a set of jars in order (indicating precedence) and then you can refer to a class defined in any of the jars. Why should we differ from the normal JVM semantics and have a special name for the first jar in the ordering?

@liancheng
Copy link
Contributor Author

@pwendell OK, the java -firstCpElement example really convinced me :) I used to think asking users to care about the order of the jars is a little too much, but every sane programmer over JVM should care about this anyway.

I'll try to deliver a version as you described soon.

@andrewor14
Copy link
Contributor

@pwendell How about for python files? What if I have one.py and two.py that reference each other, and I want spark-submit to run the main method of one.py but not two.py. Since we don't specify a class, we can't distinguish between the two main methods unless we impose the requirement that the primary python file must be the first one.

@liancheng
Copy link
Contributor Author

@andrewor14 I believe Patrick means:

  1. For Scala/Java applications, the primary jar should appear as the 1st entry of --jars
  2. For Python applications, the primary Python file should appear as the 1st entry of --py-files

@andrewor14
Copy link
Contributor

Hm, I see. Even then we still need some kind of separator right? I thought the whole point of handling primary resources differently here (either under --primary or --jars or --py-files) is to provide backwards compatibility in case the user application uses --. If we pick a Spark specific enough separator, doesn't this issue go away?

@andrewor14
Copy link
Contributor

Let's say we're using the --jars approach, and I run the following two commands. (Correct me if I'm understanding your proposal incorrectly; I am basing this off @liancheng's pseudocode in an earlier comment.) Here I am assuming that --arg1 and --arg2 are also spark-submit arguments:

bin/spark-submit --jars app.jar --arg1 -- --arg2

Here we treat --arg1 as a spark-submit argument, but pass --arg2 to the application. The user may also specify the primary jar explicitly as follows:

bin/spark-submit app.jar --arg1 -- --arg2

Here we treat both --arg1 and --arg2 as spark-submit arguments, but we may pass -- to the application depending on what --arg1 is. For instance, if --arg1 is --name and takes in a value, then -- will become the app name. Otherwise, if --arg1 is --supervise, which does not take in a value, then -- will be passed to the application.

From the user's perspective, the ways we specify the primary resource in these two commands are near equivalent. However, the arguments are actually parsed very differently. On the other hand, if we simply add a new Spark specific config (--spark-application-args or something) and keep the way we specify the primary resource the same, we get backwards compatibility for free while providing this new functionality. This latter approach just seems simpler to me.

@mateiz
Copy link
Contributor

mateiz commented Aug 5, 2014

Hey Cheng, a couple of comments on this:

  • In Java, we should be able to run without a --primary if only --jars and --main-class are given. I'd rather support that than add the primary argument here.
  • If we do this, then the --primary argument becomes relevant only for Python, and we can give it a specific name there, e.g. --main-file. Another option is to add --module and execute the __main__ function in a specific module with python -m module-name. This is the more Pythonic way to do it AFAIK but it will also be more work, so I'd just stick to --main-file in 1.1 and add --module later.

@mateiz
Copy link
Contributor

mateiz commented Aug 5, 2014

BTW if we do this then there's no need to special-case spark-internal and such, right? We can just pass a class name. Or let me know if that's not true. In that case I'd say we should have a "for Spark use only" parameter here but not expose --primary.

@liancheng
Copy link
Contributor Author

Thanks Matei, Patrick's last a few comments have already convinced me to remove the "primary" notion from a user's perspective. And yes, spark-internal can be removed in this way, spark-shell and pyspark-shell can also be removed by checking --class in Spark scripts. Internally, to keep code relies on primaryResource intact (one example is the cluster deploy mode in a standalone cluster), we can still pick the first entry in --jar for Java/Scala apps and --main-file for Python apps as primary.

@mateiz
Copy link
Contributor

mateiz commented Aug 5, 2014

Sure, that sounds good. The only thing if you use the first entry in --jar is I wouldn't automatically look for a main class in that (there's this part that checks the JAR manifest). Instead, make them use --main-class in that case. Otherwise it's a bit confusing that you give only JARs and it starts running some program.

asfgit pushed a commit that referenced this pull request Aug 6, 2014
JIRA issues:

- Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
- Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)

Related PR:

- #1715

This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1801 from liancheng/spark-2874 and squashes the following commits:

8045d7a [Cheng Lian] Make sure test suites pass
8493a9e [Cheng Lian] Using eval to retain quoted arguments
aed523f [Cheng Lian] Fixed typo in bin/spark-sql
f12a0b1 [Cheng Lian] Worked arount SPARK-2678
daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
(cherry picked from commit a6cd311)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
asfgit pushed a commit that referenced this pull request Aug 6, 2014
JIRA issues:

- Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
- Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)

Related PR:

- #1715

This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1801 from liancheng/spark-2874 and squashes the following commits:

8045d7a [Cheng Lian] Make sure test suites pass
8493a9e [Cheng Lian] Using eval to retain quoted arguments
aed523f [Cheng Lian] Fixed typo in bin/spark-sql
f12a0b1 [Cheng Lian] Worked arount SPARK-2678
daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
chutium added a commit to chutium/spark that referenced this pull request Aug 8, 2014
asfgit pushed a commit that referenced this pull request Aug 10, 2014
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1715, Closes #1864, and Closes #1861

Closes #1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
@asfgit asfgit closed this in 4f4a988 Aug 10, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
JIRA issues:

- Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
- Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)

Related PR:

- apache#1715

This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes apache#1801 from liancheng/spark-2874 and squashes the following commits:

8045d7a [Cheng Lian] Make sure test suites pass
8493a9e [Cheng Lian] Using eval to retain quoted arguments
aed523f [Cheng Lian] Fixed typo in bin/spark-sql
f12a0b1 [Cheng Lian] Worked arount SPARK-2678
daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by apache#1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes apache#1715, Closes apache#1864, and Closes apache#1861

Closes apache#1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
@liancheng liancheng deleted the spark-2678 branch September 24, 2014 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants