-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Including sparkDeployMode parameter #3946
Conversation
This Spark parameter is experimental, used only in cluster deploy-mode and was causing issues.
Codecov Report
@@ Coverage Diff @@
## master #3946 +/- ##
==============================================
- Coverage 78.483% 78.06% -0.422%
+ Complexity 16560 16490 -70
==============================================
Files 1058 1058
Lines 59682 59682
Branches 9712 9712
==============================================
- Hits 46840 46588 -252
- Misses 9084 9346 +262
+ Partials 3758 3748 -10
|
@@ -20,6 +20,9 @@ | |||
@Argument(fullName = "sparkMaster", doc="URL of the Spark Master to submit jobs to when using the Spark pipeline runner.", optional = true) | |||
private String sparkMaster = SparkContextFactory.DEFAULT_SPARK_MASTER; | |||
|
|||
@Argument(fullName = "sparkDeployMode", doc="Mode of the spark driver. Default value: client. Possible values: {client, cluster}.", optional = true) | |||
private String sparkDeployMode = "client"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this should be set in the SparkContext
for the SparkCommandLineProgram
using the SparkContextFactory
. But as a suggestion for downstream projects: can it be possible to make SparkCommandLineArgumentCollection
abstract (or an interface), and pass the specified sparkMaster to the map in getSparkProperties()
(I guess that using propertyMap.put("spark.master", sparkMaster)
should work as expected...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the deploy-mode is related to the spark driver, a.k.a. sparkContext, I put it with the sparkMaster instead of more jobs-related spark parameters (as the programName). However, I could and I should do an accessor (getSparkDeployMode) and try to use it.
At first, I checked if SparkContext was able to give the default deploy-mode (which is 'client' most of the time) to do something as private String sparkDeployMode = SparkContextFactory.DEFAULT_SPARK_DEPLOYMODE;
but I failed. I could invest more time and add it myself but I didn't get enough.
I don't think I really undestand your second part. Do you mean getSparkProperties
should include the sparkMaster property too ? Is it not already the case ? (I didn't check this part :/).
Note: I'm not a GATK developer. I'm just someone who need the ability to set the deploy-mode. As I had to edit the code to remove a "bug" (the UserClassPathFirst property set to true instead of the default false), I took the opportunity to implement the deploy-mode parameter to avoid to do --conf spark.submit.deployMode=cluster
. I'm not sure I have done it in the GATK way...
@AxVE Thanks for this PR. We really appreciate your interest and work on resolving this issue! It might take a little bit for me to get to reviewing it properly, we're currently preparing for our release and we're a bit swamped with various issues. I'm worried about changing the @cwhelan Would you be able to test your pipeline with - "spark.driver.userClassPathFirst" : "false" and see if you run into any issues? I'm also a bit confused about why the change to the arguments is necessary. Clearly in your environment it is, but it goes against my understanding of how we set the arguments to spark submit, so I want to properly understand why the existing --deploy-mode arguments aren't working for you before adding an additional hardcoded argument to the launch script. (As I'm sure you've seen, the launch script is a pretty crufty and brittle piece of code that was really meant to be replaced with a more robust solution by now, so any additional complexity in would be great to avoid...) |
@lbergelson I understand the change of userClassPathFirst can cause problems. I added the parameter in my fork to test. I still think it's important to have it in gatk's parameters because it's simple for users, but it's not an emergency. For me, the userClassPathFirst change is important. Or a parameter to specify it. Without it, I can't get my jobs to work in cluster mode. |
@lbergelson I recheck with |
Update to gatk 4.0
@cwhelan Is there any chance you could run an SV pipeline with this change and see if it works? We added the classpath setting a long time ago for mysterious reasons, and have been afraid to remove it because we don't have good automated tests that run on the actual dataproc environment. I ran our very simple tests with this change but I want to check that it doesn't have negative consequences for your tools. I would really like to merge this if we can because it's recommended that you don't use this option unless you absolutely have to. |
I just ran our pipeline with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't been able to find any justification for keeping this set to true. It seems likely we've outgrown whatever issue this was a workaround for. We'll have to monitor and see if we hit any new issues after changing this, but tests, automated and manual, haven't found anything.
@AxVE Thanks for this! Sorry for the long wait. We'll have to monitor to see if changing it introduces some obscure issues for us that we haven't been able to figure out yet, but it seems like it's probably safe. |
@lbergelson No problem and thanks you. If I find out some other issues ( I'm going to test some Spark tools) on my side I'll say it to you. |
A solution for the issue #3933