Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Picard integration. #3620

Merged
merged 1 commit into from
Oct 2, 2017
Merged

Picard integration. #3620

merged 1 commit into from
Oct 2, 2017

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Sep 27, 2017

Notes:

  • Classes in hellbender/tools/picard/analysis/artifacts are removed and replaced with Picard versions (except Transition, which is not public in Picard).
  • GATK version of GatherVcfs is retained, and the Picard version is masked out - is this what we want ?
  • The non-Spark GATK metrics tools have been removed and replaced with the Picard versions. The test data is retained (but moved) since its used by the Spark metrics tool tests.

Additional changes we'll want to make separately to minimize the complexity of this PR:

  • Eliminate the download of picard.jar from the GATK WDL tests and update the WDL to run Picard tools through GATK.
  • Unify and merge the Picard and GATK program groups. These are similar, but not identical, and the combined result has artificial/duplicate groups.
  • Normalize the confusing mix of Alpha/Beta/Experimental tags and comments.
  • Add unified doc and tab-completion tasks that include Picard.
  • Remove and replace SamComparison and Transition classes with the Picard versions.
  • Fix GATK CompareBaseQualities (its a PicardCommandLineProgram).

@droazen droazen assigned droazen and lbergelson and unassigned droazen Sep 27, 2017
@codecov-io
Copy link

codecov-io commented Sep 27, 2017

Codecov Report

Merging #3620 into master will decrease coverage by 0.477%.
The diff coverage is 80.556%.

@@               Coverage Diff               @@
##              master     #3620       +/-   ##
===============================================
- Coverage     79.846%   79.368%   -0.477%     
+ Complexity     18309     17286     -1023     
===============================================
  Files           1226      1140       -86     
  Lines          67177     62438     -4739     
  Branches       10507      9482     -1025     
===============================================
- Hits           53638     49556     -4082     
+ Misses          9288      9108      -180     
+ Partials        4251      3774      -477
Impacted Files Coverage Δ Complexity Δ
...llbender/utils/read/markduplicates/PairedEnds.java 84.783% <ø> (ø) 22 <0> (ø) ⬇️
...spark/pipelines/metrics/MetricsCollectorSpark.java 100% <ø> (ø) 3 <0> (ø) ⬇️
...ntationbiasvariantfilter/OrientationBiasUtils.java 84.956% <ø> (ø) 56 <0> (ø) ⬇️
...ellbender/tools/exome/FilterByOrientationBias.java 83.019% <ø> (ø) 14 <0> (ø) ⬇️
...iantfilter/OrientationSampleTransitionSummary.java 52.941% <ø> (ø) 9 <0> (ø) ⬇️
...tionbiasvariantfilter/OrientationBiasFilterer.java 93.956% <ø> (ø) 58 <0> (ø) ⬇️
.../markduplicates/MarkDuplicatesScoringStrategy.java 82.353% <ø> (ø) 7 <0> (ø) ⬇️
...institute/hellbender/utils/read/SamComparison.java 0% <ø> (-80.364%) 0 <0> (-86)
...hellbender/utils/read/markduplicates/ReadsKey.java 94.444% <ø> (ø) 9 <0> (ø) ⬇️
...biasvariantfilter/PreAdapterOrientationScorer.java 96.429% <ø> (ø) 5 <0> (ø) ⬇️
... and 50 more

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and minor quibbles. Looks reasonable to me.


// Turn off the Picard legacy parser and opt in to Barclay syntax for Picard tools. This should be replaced
// with a config setting once PR https://github.com/broadinstitute/gatk/pull/3447 is merged.
System.setProperty("picard.useLegacyParser", "false");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be hardcoded and not set in the config. I don't think we want to allow people to toggle it at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts to allow this to be configured by users who know what they're doing.

@@ -282,7 +294,15 @@ private static CommandLineProgram extractCommandLineProgram(final String[] args,
if (simpleNameToClass.containsKey(args[0])) {
final Class<?> clazz = simpleNameToClass.get(args[0]);
try {
return (CommandLineProgram) clazz.newInstance();
if (clazz.getName().startsWith("picard.")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we care, we can probably remove this warning by instantiating first and then doing an instance of check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that an instanceof check would be slightly better here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* This is the entry point for Picard tools that are called from GATK.
*/
public Object instanceMain(final String[] argv) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing @Override?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -229,7 +230,7 @@ protected void saveResults(final MetricsFile<?, Integer> metrics, final SAMFileH
plotSubtitle = StringUtil.asEmptyIfNull(readGroups.get(0).getLibrary());
}
final RScriptExecutor executor = new RScriptExecutor();
executor.addScript(new Resource(CollectBaseDistributionByCycle.R_SCRIPT, CollectBaseDistributionByCycle.class));
executor.addScript(new Resource(R_SCRIPT, CollectBaseDistributionByCycleSpark.class));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that the rscript still resolves? I don't think the tests invoke the Rscripts in most cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it manually and it resolves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add unit tests for all of these R scripts that just assert that they are resolvable.

@@ -1,4 +1,4 @@
package org.broadinstitute.hellbender.tools.picard.analysis.artifacts;
package org.broadinstitute.hellbender.utils.artifacts;

import htsjdk.samtools.util.SequenceUtil;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably open a ticket to expose this in picard and remove it in gatk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. #3632

sort.SORT_ORDER = SAMFileHeader.SortOrder.coordinate;
sort.VALIDATION_STRINGENCY = stringency;
sort.runTool();
// this is running a Picard tools do don't use ArgumentBuilder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use argbuilder with picard tools?
Also, typo do don't

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the comment is misleading and unnecessary.


@Override
public Object doWork() {
// This method should never be called directly. Call instanceMain instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be public?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be - good catch.

"--LINE_LENGTH", "5",
};

Assert.assertNotEquals(runCommandLine(args), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we get a bad return value here instead of an exception bubbling up? If the args are no good shouldn't barclay be throwing an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picard swallows Barclay exceptions and propagates them as a return code since it also has to work with the legacy parser, which uses return codes instead of exceptions.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Sep 27, 2017

@LeeTL1220 Can you review, or appoint someone to review, the Mutect2 WDL and the tools/exome and tools/picard/analysis/artifacts packages part of this. Basically, I replaced most of the classes in the artifacts package with their Picard analogs, which also affected some embedded class names in metrics files (SequencingArtifactMetrics IIRC). The are multiple commits in here that delete various things, but the substantive changes are the last commit ("Integrate actual Picard tools via Picard"). Also, once this is in, we'll probably want to do this separately as well.

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be grateful if you consider the following comments. The changes in the Main class have a huge impact in downstream projects, making impossible to tweak properly the command line. The Picard hack adds a lot of complexity in addition to make Main difficult to extend for downstream projects. All these can be solved by a simple patch in Barclay with a common interface to be extended by dowsntream projects for runnable tools.

/** Get the set of classes that are our command line programs **/
final ClassFinder classFinder = new ClassFinder();
for (final String pkg : packageList) {
classFinder.find(pkg, picard.cmdline.CommandLineProgram.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting really complicated for combining projects/toolkits and re-use this main. I guess that my idea of adding a really simple CommandLineProgram interface in Barclay is worth now and something simple to implement (just with a single method instanceMain(String[] args) to being able to combine toolkits without this big changes. I will submit a PR to add that simple layout to simplify things.

In addtion, a simple Main class could be implemented in Barclay to keep some of this logic and simplify things a bit, but that's a bigger change. I would submit a different PR for this to Barclay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, without the interface in Barclay, downstream projects are unable to specify single Picard classes in getClassList(), and it would require yet another method to return picard classess...which is kind of cumbersome...

// usually because want to favor the like-named GATK tool. Any CommandLineProgram that shares a simple class
// name with a Picard tool must be on this list. Failure to mask it out will result in name collisions of
// simple class names in Main.
private static Set<String> getMaskedToolClasses() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to getPackageList() and getClassList(), this should be a protected non-static method, overridable but downstream projects, to provide a set of tools to remove from the command line. Downstream projects require more control on this.

On the other hand, I suggest to change the signature to return List<Class<? extends CommandLineProgram>>, but that should return a common interface for tools (both Picard and GATK).

for (final Class<?> clazz : toCheck) {
if (maskedPicardToolClasses.contains(clazz.getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using clazz.getName(), here a direct comparison of classes will be better if there is a common interface....

@SuppressWarnings("unchecked")
Class<? extends picard.cmdline.CommandLineProgram> picardClazz =
(Class<? extends picard.cmdline.CommandLineProgram>) clazz;
return new PicardCommandLineProgramExecutor(picardClazz.newInstance());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common interface in Barclay will reduce the complexity here and keep the previous code untouchable...

* it's own CommandLineProgramProperties annotation, and isn't intended to be run from the
* command line.
*/
public class PicardCommandLineProgramExecutor extends CommandLineProgram {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this hack at all...for downstream projects is going to be very difficult to include/exclude picard tools...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class is fine -- it's just a simple adapter/shim, which is a pretty standard design pattern when integrating multiple codebases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but it is impossible to include single Picard tools at the moment in Main due to this adapter class...

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Assuming the tests all pass, I'm fine. We already talked in person about additional concerns that you have addressed (e.g. VcfToIntervalList)

Fix the one typo I found and feel free to merge pending other reviewers.

# Convert to GATK format
sed -r "s/picard\.analysis\.artifacts\.SequencingArtifactMetrics\\\$PreAdapterDetailMetrics/org\.broadinstitute\.hellbender\.tools\.picard\.analysis\.artifacts\.SequencingArtifactMetrics\$PreAdapterDetailMetrics/g" \
"metrics.pre_adapter_detail_metrics" > "gatk.pre_adapter_detail_metrics"
java -Xmx4G -jar ${picard_jar} CollectSequencingArtifactMetrics I=${tumor_bam} O="gatk" R=${ref_fasta} VALIDATION_STRINGENCY=LENIENT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great assuming it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I confirmed that FilterByOrientationBias is being run.

sort.SORT_ORDER = SAMFileHeader.SortOrder.coordinate;
sort.VALIDATION_STRINGENCY = stringency;
sort.runTool();
// this is running a Picard tools do don't use ArgumentBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "tools" --> "tool"

maskedPicardToolClasses.add("org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor");

// Mask out picard GatherVcfs in favor of GATK GatherVcfs
maskedPicardToolClasses.add("picard.vcf.GatherVcfs");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would eliminate the masking of Picard's GatherVcfs entirely, and just rename the GATK version. Suggested name for the GATK version: GatherVcfsCloud

"--TRUNCATE_SEQUENCE_NAMES_AT_WHITESPACE", "TRUE",
"--LINE_LENGTH", "5",
};

runCommandLine(args);

Assert.assertNotEquals(runCommandLine(args), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment highlighting that you're expecting a non-zero return code here.


// use GATK-style lower case names, which are rejected by Picard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need slightly clearer explanation here of why this fails (eg., mention explicitly that the Picard arg is named --OUTPUT)

@@ -2,8 +2,9 @@

import htsjdk.samtools.metrics.MetricsFile;
import org.apache.commons.math3.linear.RealMatrix;
import org.broadinstitute.hellbender.tools.picard.analysis.artifacts.SequencingArtifactMetrics;
import org.broadinstitute.hellbender.tools.picard.analysis.artifacts.Transition;
//import org.broadinstitute.hellbender.utils.artifacts.SequencingArtifactMetrics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out import statement

@@ -14,7 +14,7 @@
import java.util.List;

public final class CollectBaseDistributionByCycleSparkIntegrationTest extends CommandLineProgramTest {
private static final File TEST_DATA_DIR = new File(getTestDataDir(), "picard/analysis/CollectBaseDistributionByCycle");
private static final File TEST_DATA_DIR = new File(getTestDataDir(), "../metrics/analysis/CollectBaseDistributionByCycle");
Copy link
Contributor

@droazen droazen Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid use of .. in test data paths (applies below as well)

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few comments. I mainly want to see the GATK GatherVcfs renamed (and the masking for that tool removed), and basic unit tests added to prove that we can still resolve all needed R scripts. It also looks like there may be some R scripts and test files that are unused but still checked in, but you can deal with these in a separate branch.

Merge after addressing comments (don't wait for another review pass)

@droazen droazen assigned cmnbroad and unassigned droazen and lbergelson Sep 28, 2017
@cmnbroad
Copy link
Collaborator Author

@magicDGS I think the suggestion of using a common interface is a reasonable proposal, but we can't hold up this PR for that. We'll have to do that in a separate PR once the interface is propagated through the various projects.

@cmnbroad cmnbroad force-pushed the cn_picard branch 6 times, most recently from 99db69e to 777db7b Compare September 29, 2017 13:28
Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you cannot wait, @cmnbroad - I've just feel that this requires at least some possibility to tweak which tools are included from Picard. See below my suggestions...

@@ -88,6 +93,7 @@ protected static void printDecoratedExceptionMessage(final PrintStream ps, final
protected List<String> getPackageList() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add to the documentation the packages that are included at the moment, for downstream projects being aware that Picard is also there...

/** Get the set of classes that are our command line programs **/
final ClassFinder classFinder = new ClassFinder();
for (final String pkg : packageList) {
classFinder.find(pkg, picard.cmdline.CommandLineProgram.class);
classFinder.find(pkg, CommandLineProgram.class);
}
String missingAnnotationClasses = "";
final Set<Class<?>> toCheck = classFinder.getClasses();
toCheck.addAll(classList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For downstream projects, it may be useful until the CLP interface is included in Barclay (and used in Picard/GATK) to have the possibility to add single Picard classes instead of package-specific...similar to getClassList. Maybe the signature of getClassList can change in the meanwhile to just return Class<?> and check here if it is one of the accepted classes?

@cmnbroad cmnbroad force-pushed the cn_picard branch 2 times, most recently from 8238389 to 3337824 Compare September 30, 2017 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants