-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Picard integration. #3620
Picard integration. #3620
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3620 +/- ##
===============================================
- Coverage 79.846% 79.368% -0.477%
+ Complexity 18309 17286 -1023
===============================================
Files 1226 1140 -86
Lines 67177 62438 -4739
Branches 10507 9482 -1025
===============================================
- Hits 53638 49556 -4082
+ Misses 9288 9108 -180
+ Partials 4251 3774 -477
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and minor quibbles. Looks reasonable to me.
|
||
// Turn off the Picard legacy parser and opt in to Barclay syntax for Picard tools. This should be replaced | ||
// with a config setting once PR https://github.com/broadinstitute/gatk/pull/3447 is merged. | ||
System.setProperty("picard.useLegacyParser", "false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be hardcoded and not set in the config. I don't think we want to allow people to toggle it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it hurts to allow this to be configured by users who know what they're doing.
@@ -282,7 +294,15 @@ private static CommandLineProgram extractCommandLineProgram(final String[] args, | |||
if (simpleNameToClass.containsKey(args[0])) { | |||
final Class<?> clazz = simpleNameToClass.get(args[0]); | |||
try { | |||
return (CommandLineProgram) clazz.newInstance(); | |||
if (clazz.getName().startsWith("picard.")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we care, we can probably remove this warning by instantiating first and then doing an instance of check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that an instanceof
check would be slightly better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/** | ||
* This is the entry point for Picard tools that are called from GATK. | ||
*/ | ||
public Object instanceMain(final String[] argv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing @Override
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -229,7 +230,7 @@ protected void saveResults(final MetricsFile<?, Integer> metrics, final SAMFileH | |||
plotSubtitle = StringUtil.asEmptyIfNull(readGroups.get(0).getLibrary()); | |||
} | |||
final RScriptExecutor executor = new RScriptExecutor(); | |||
executor.addScript(new Resource(CollectBaseDistributionByCycle.R_SCRIPT, CollectBaseDistributionByCycle.class)); | |||
executor.addScript(new Resource(R_SCRIPT, CollectBaseDistributionByCycleSpark.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test that the rscript still resolves? I don't think the tests invoke the Rscripts in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it manually and it resolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add unit tests for all of these R scripts that just assert that they are resolvable.
@@ -1,4 +1,4 @@ | |||
package org.broadinstitute.hellbender.tools.picard.analysis.artifacts; | |||
package org.broadinstitute.hellbender.utils.artifacts; | |||
|
|||
import htsjdk.samtools.util.SequenceUtil; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably open a ticket to expose this in picard and remove it in gatk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. #3632
sort.SORT_ORDER = SAMFileHeader.SortOrder.coordinate; | ||
sort.VALIDATION_STRINGENCY = stringency; | ||
sort.runTool(); | ||
// this is running a Picard tools do don't use ArgumentBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use argbuilder with picard tools?
Also, typo do don't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the comment is misleading and unnecessary.
|
||
@Override | ||
public Object doWork() { | ||
// This method should never be called directly. Call instanceMain instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be - good catch.
"--LINE_LENGTH", "5", | ||
}; | ||
|
||
Assert.assertNotEquals(runCommandLine(args), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we get a bad return value here instead of an exception bubbling up? If the args are no good shouldn't barclay be throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picard swallows Barclay exceptions and propagates them as a return code since it also has to work with the legacy parser, which uses return codes instead of exceptions.
@LeeTL1220 Can you review, or appoint someone to review, the Mutect2 WDL and the tools/exome and tools/picard/analysis/artifacts packages part of this. Basically, I replaced most of the classes in the artifacts package with their Picard analogs, which also affected some embedded class names in metrics files (SequencingArtifactMetrics IIRC). The are multiple commits in here that delete various things, but the substantive changes are the last commit ("Integrate actual Picard tools via Picard"). Also, once this is in, we'll probably want to do this separately as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be grateful if you consider the following comments. The changes in the Main
class have a huge impact in downstream projects, making impossible to tweak properly the command line. The Picard hack adds a lot of complexity in addition to make Main
difficult to extend for downstream projects. All these can be solved by a simple patch in Barclay with a common interface to be extended by dowsntream projects for runnable tools.
/** Get the set of classes that are our command line programs **/ | ||
final ClassFinder classFinder = new ClassFinder(); | ||
for (final String pkg : packageList) { | ||
classFinder.find(pkg, picard.cmdline.CommandLineProgram.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting really complicated for combining projects/toolkits and re-use this main. I guess that my idea of adding a really simple CommandLineProgram
interface in Barclay is worth now and something simple to implement (just with a single method instanceMain(String[] args)
to being able to combine toolkits without this big changes. I will submit a PR to add that simple layout to simplify things.
In addtion, a simple Main
class could be implemented in Barclay to keep some of this logic and simplify things a bit, but that's a bigger change. I would submit a different PR for this to Barclay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, without the interface in Barclay, downstream projects are unable to specify single Picard classes in getClassList()
, and it would require yet another method to return picard classess...which is kind of cumbersome...
// usually because want to favor the like-named GATK tool. Any CommandLineProgram that shares a simple class | ||
// name with a Picard tool must be on this list. Failure to mask it out will result in name collisions of | ||
// simple class names in Main. | ||
private static Set<String> getMaskedToolClasses() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to getPackageList()
and getClassList()
, this should be a protected non-static method, overridable but downstream projects, to provide a set of tools to remove from the command line. Downstream projects require more control on this.
On the other hand, I suggest to change the signature to return List<Class<? extends CommandLineProgram>>
, but that should return a common interface for tools (both Picard and GATK).
for (final Class<?> clazz : toCheck) { | ||
if (maskedPicardToolClasses.contains(clazz.getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using clazz.getName()
, here a direct comparison of classes will be better if there is a common interface....
@SuppressWarnings("unchecked") | ||
Class<? extends picard.cmdline.CommandLineProgram> picardClazz = | ||
(Class<? extends picard.cmdline.CommandLineProgram>) clazz; | ||
return new PicardCommandLineProgramExecutor(picardClazz.newInstance()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common interface in Barclay will reduce the complexity here and keep the previous code untouchable...
* it's own CommandLineProgramProperties annotation, and isn't intended to be run from the | ||
* command line. | ||
*/ | ||
public class PicardCommandLineProgramExecutor extends CommandLineProgram { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this hack at all...for downstream projects is going to be very difficult to include/exclude picard tools...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class is fine -- it's just a simple adapter/shim, which is a pretty standard design pattern when integrating multiple codebases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but it is impossible to include single Picard tools at the moment in Main
due to this adapter class...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Assuming the tests all pass, I'm fine. We already talked in person about additional concerns that you have addressed (e.g. VcfToIntervalList)
Fix the one typo I found and feel free to merge pending other reviewers.
# Convert to GATK format | ||
sed -r "s/picard\.analysis\.artifacts\.SequencingArtifactMetrics\\\$PreAdapterDetailMetrics/org\.broadinstitute\.hellbender\.tools\.picard\.analysis\.artifacts\.SequencingArtifactMetrics\$PreAdapterDetailMetrics/g" \ | ||
"metrics.pre_adapter_detail_metrics" > "gatk.pre_adapter_detail_metrics" | ||
java -Xmx4G -jar ${picard_jar} CollectSequencingArtifactMetrics I=${tumor_bam} O="gatk" R=${ref_fasta} VALIDATION_STRINGENCY=LENIENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great assuming it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I confirmed that FilterByOrientationBias is being run.
sort.SORT_ORDER = SAMFileHeader.SortOrder.coordinate; | ||
sort.VALIDATION_STRINGENCY = stringency; | ||
sort.runTool(); | ||
// this is running a Picard tools do don't use ArgumentBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "tools" --> "tool"
maskedPicardToolClasses.add("org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor"); | ||
|
||
// Mask out picard GatherVcfs in favor of GATK GatherVcfs | ||
maskedPicardToolClasses.add("picard.vcf.GatherVcfs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would eliminate the masking of Picard's GatherVcfs
entirely, and just rename the GATK version. Suggested name for the GATK version: GatherVcfsCloud
"--TRUNCATE_SEQUENCE_NAMES_AT_WHITESPACE", "TRUE", | ||
"--LINE_LENGTH", "5", | ||
}; | ||
|
||
runCommandLine(args); | ||
|
||
Assert.assertNotEquals(runCommandLine(args), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment highlighting that you're expecting a non-zero return code here.
|
||
// use GATK-style lower case names, which are rejected by Picard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need slightly clearer explanation here of why this fails (eg., mention explicitly that the Picard arg is named --OUTPUT
)
@@ -2,8 +2,9 @@ | |||
|
|||
import htsjdk.samtools.metrics.MetricsFile; | |||
import org.apache.commons.math3.linear.RealMatrix; | |||
import org.broadinstitute.hellbender.tools.picard.analysis.artifacts.SequencingArtifactMetrics; | |||
import org.broadinstitute.hellbender.tools.picard.analysis.artifacts.Transition; | |||
//import org.broadinstitute.hellbender.utils.artifacts.SequencingArtifactMetrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out import statement
@@ -14,7 +14,7 @@ | |||
import java.util.List; | |||
|
|||
public final class CollectBaseDistributionByCycleSparkIntegrationTest extends CommandLineProgramTest { | |||
private static final File TEST_DATA_DIR = new File(getTestDataDir(), "picard/analysis/CollectBaseDistributionByCycle"); | |||
private static final File TEST_DATA_DIR = new File(getTestDataDir(), "../metrics/analysis/CollectBaseDistributionByCycle"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid use of ..
in test data paths (applies below as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few comments. I mainly want to see the GATK GatherVcfs
renamed (and the masking for that tool removed), and basic unit tests added to prove that we can still resolve all needed R scripts. It also looks like there may be some R scripts and test files that are unused but still checked in, but you can deal with these in a separate branch.
Merge after addressing comments (don't wait for another review pass)
@magicDGS I think the suggestion of using a common interface is a reasonable proposal, but we can't hold up this PR for that. We'll have to do that in a separate PR once the interface is propagated through the various projects. |
99db69e
to
777db7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you cannot wait, @cmnbroad - I've just feel that this requires at least some possibility to tweak which tools are included from Picard. See below my suggestions...
@@ -88,6 +93,7 @@ protected static void printDecoratedExceptionMessage(final PrintStream ps, final | |||
protected List<String> getPackageList() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add to the documentation the packages that are included at the moment, for downstream projects being aware that Picard is also there...
/** Get the set of classes that are our command line programs **/ | ||
final ClassFinder classFinder = new ClassFinder(); | ||
for (final String pkg : packageList) { | ||
classFinder.find(pkg, picard.cmdline.CommandLineProgram.class); | ||
classFinder.find(pkg, CommandLineProgram.class); | ||
} | ||
String missingAnnotationClasses = ""; | ||
final Set<Class<?>> toCheck = classFinder.getClasses(); | ||
toCheck.addAll(classList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For downstream projects, it may be useful until the CLP interface is included in Barclay (and used in Picard/GATK) to have the possibility to add single Picard classes instead of package-specific...similar to getClassList
. Maybe the signature of getClassList
can change in the meanwhile to just return Class<?>
and check here if it is one of the accepted classes?
8238389
to
3337824
Compare
Notes:
Additional changes we'll want to make separately to minimize the complexity of this PR: