-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6186] [EC2] Make Tachyon version configurable in EC2 deployment script #4901
Conversation
Can one of the admins verify this patch? |
Please make a JIRA |
Ok to test |
@uronce-cc you need to update the title of this PR. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark |
@@ -871,10 +898,11 @@ def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules): | |||
|
|||
if "." in opts.spark_version: | |||
# Pre-built Spark deploy | |||
spark_v = get_validate_spark_version(opts.spark_version, opts.spark_git_repo) | |||
(spark_v, tachyon_v) = get_spark_tachyon_version(opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing the validation of the Spark version here? And why are we ignoring the user-specified GitHub repo?
This change doesn't look correct to me. Did I misunderstand something?
I don't think we should duplicate the list of valid Spark versions. Instead, it probably makes more sense to have a separate dictionary just for Tachyon version mappings and put that towards the beginning of the script as a global so it's easy to find and update, since that will be done manually. Also, let's please reuse the existing code to validate the Spark version and avoid method signatures that just take an For example: SPARK_TACHYON_MAP = {
"1.0.0": "0.4.1",
"1.0.1": "0.4.1",
"1.0.2": "0.4.1",
"1.1.0": "0.5.0",
"1.1.1": "0.5.0",
"1.2.0": "0.5.0",
"1.2.1": "0.5.0",
}
def get_tachyon_version(spark_version):
if spark_version not in SPARK_TACHYON_MAP:
return ""
else:
return SPARK_TACHYON_MAP[spark_version]
if "." in opts.spark_version:
spark_v = get_validate_spark_version(opts.spark_version, opts.spark_git_repo)
tachyon_v = get_tachyon_version(spark_v)
else:
... |
jenkins, test this please |
Test build #28300 has started for PR 4901 at commit
|
Test build #28300 has finished for PR 4901 at commit
|
Test FAILed. |
@nchammas comments resolved. thanks. |
@uronce-cc this apparently still does not pass Python style checks. Please have a look at the Jenkins output and/or run locally. |
FYI @haoyuan @uronce-cc: TACHYON-11 is not publicly visible. |
@@ -148,6 +160,10 @@ def parse_args(): | |||
"-v", "--spark-version", default=DEFAULT_SPARK_VERSION, | |||
help="Version of Spark to use: 'X.Y.Z' or a specific git hash (default: %default)") | |||
parser.add_option( | |||
"--tachyon-version", | |||
default=DEFAULT_TACHYON_VERSION, | |||
help="When --spark-version is a git hash, this will be used as the version of Tachyon") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording here is a little confusing, I think. So this option only takes effect if --spark-version
is a git hash and otherwise it is ignored, right?
And what's the purpose of the default value here? Doesn't seem to ever come into play.
Jenkins, retest this please. |
Test build #28366 has started for PR 4901 at commit
|
Test build #28366 has finished for PR 4901 at commit
|
Test FAILed. |
improve pep8 error when lacking argument pep8 again...
@srowen please test again... still pep8 problem. |
@uronce-cc I thought you had run the checks locally per above and resolved them. Please do that. Rerunning jenkins would not solve it. |
@srowen Yes, I've run locally pep8 check through the pep8 util provided via 21:29 cc@mbp ec2 pep8 spark_ec2.py current warnings/errors are not caused by my modification. Should I resolve all of the remaining errors or let them go. |
@uronce-cc oh, no you should run |
@srowen ok, current PR has passed pep8 by lint-python |
I don't see that you pushed anything more to this branch? |
@srowen I've stashed all subsequent commits to make the commit history clean. |
It looks like @uronce-cc has pushed a new commit https://github.com/uronce-cc/spark/commit/6f8887e96d5a5aae488048038e05a0f573326b79 that hasn't been tested yet. (The older one which was tested was df054cd) I'll kick off another Jenkins run |
Jenkins, retest this please |
Test build #28368 has started for PR 4901 at commit
|
Oh weird, github still says the commit is 2 days old. Jenkins didn't retest automatically either. It's in progress now, yes. You don't need to worry about squashing as the merge script will do that. |
could it be merged? another task is blocked by this. thanks. |
Test build #28368 has finished for PR 4901 at commit
|
Test PASSed. |
else: | ||
# Spark-only custom deploy | ||
spark_v = "%s|%s" % (opts.spark_git_repo, opts.spark_version) | ||
if opts.tachyon_version is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @shivaram chime in here, but this is a new restriction. Now a user cannot deploy Spark at a specific git hash anymore without also specifying a Tachyon hash.
I'm not sure that makes sense from the outset, but it also breaks users who have been using spark-ec2 just to deploy Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this seems like an unnecessary restriction for users who just want to change the Spark git hash.
Given that users who use the git hash are going to be advanced users why don't we just skip version checks for it ? If users want to use a specific git hash I think its fine to expect them to figure out which tachyon version is required or not etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram You mean for advanced users, let them manually specify tachyon_v for themselves?
or export TACHYON_VERSION=xxx
manually in bash before running spark_ec2
, but then we have to disable export TACHYON_VERSION=xx
in spark_ec2.py
.
I think this option is appropriate, first, for users who do not use git hash to deploy, nothing has changed in the CLI, second, for those use git hash, this option is handy, just need to update document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that we could fall back to some default tachyon version (say 0.6) if spark git hash is set but tachyon version is not set. We don't need to do force the user to pass this option then but if the user needs some specific tachyon version or git-hash they can set it.
This can be done by just setting a default value for that flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In former commit, I do set a default value, but I think if someone uses git hash and doesn't specify tachyon version, and the default tachyon version doesn't match spark's, then it's hard for he/she to know the error unless he/she has carefully read the document or the command line help or read the src, so I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram share some thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but as @nchammas mentioned only some fraction of Spark usage scenarios involve using Tachyon. If one just wants to say run the master
branch to do some spark-perf benchmarks expecting the user to set a Tachyon version seems like a unnecessary burden.
If we don't want to use a default tachyon version I'd vote for just not starting tachyon in cases where a spark git hash is used. (i.e. remove tachyon from the list of modules and print a info message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
@shivaram I've removed tachyon when using git hash. |
Jenkins, retest this please |
Test build #28385 has started for PR 4901 at commit
|
Test build #28385 has finished for PR 4901 at commit
|
Test PASSed. |
else: | ||
# Spark-only custom deploy | ||
spark_v = "%s|%s" % (opts.spark_git_repo, opts.spark_version) | ||
tachyon_v = "" | ||
print "Deploy spark via git hash, Tachyon won't be set up" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploy spark
-> Deploying Spark
Thanks @uronce-cc - Change looks good to me but for the minor comment inline. |
So it looks like now we won't allow the Tachyon git hash to be specified if the Spark hash is specified. Fine by me. LGTM pending @shivaram's inline comment. |
@shivaram Could you create branch-1.4 in mesos/spark-ec2 for my following PR to finally make Tachyon configurable? thanks. |
This PR comes from Tachyon community to solve the issue:
https://tachyon.atlassian.net/browse/TACHYON-11
An accompanying PR is in mesos/spark-ec2:
mesos/spark-ec2#101