Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MergeSVEvidence #7695

Merged
merged 1 commit into from
Apr 7, 2022
Merged

MergeSVEvidence #7695

merged 1 commit into from
Apr 7, 2022

Conversation

tedsharpe
Copy link
Contributor

No description provided.

@mwalker174 mwalker174 self-assigned this Feb 25, 2022
@mwalker174
Copy link
Contributor

@droazen May want someone from engine team to look at this since it touches some core classes.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tedsharpe this looks good so far. You've done the hard part of finagling the evidence feature classes into a nice framework for this. I've taken a first pass, excluding tests, and I just had two main comments:

  1. The tool currently just merges records in dictionary order and subsets samples when necessary. I probably wasn't clear on everything the tool should do, but I have a comment below outlining the cases when Features actually need to be merged as well (e.g. RD records at the same interval with different samples). It would also be nice to have some kind of check for collisions - for example the same sample and interval defined in two of the input files is probably unexpected and should result in an error.
  2. It seems like there's a lot of overlap with the MultiVariantWalker and MultiVariantDataSource classes. IMO, VariantContext implements Feature and therefore it seems redundant to have completely separate classes for Features as well. Is there a way to consolidate the code such that MultiVariantWalker extends FeatureMergingWalker and/or defining a MultiFeatureDataSource extended by MultiVariantDataSource? @droazen any thoughts?


import htsjdk.samtools.SAMSequenceDictionary;
import org.broadinstitute.hellbender.utils.Utils;

import java.util.List;

public class FeaturesHeader {
public class SVFeaturesHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you needed to move and rename this, although I can see the class isn't used anywhere currently. Maybe it will make more sense as I go along.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.
But there's no particularly compelling reason to change it, and it can be reverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm fine with it if it's only used for SV classes

public Set<String> getSampleNames() { return samples; }

private void setDictionaryAndSamples() {
dictionary = getMasterSequenceDictionary();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use getBestAvailableSequenceDictionary() instead? Rather than defining special logic here (although you would need to modify it to look at the features headers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method pukes if there are multiple dictionaries -- even if they're equivalent. And trying to scrape an incomplete dictionary from the index compounded the problem.
This method allows multiple dictionaries if they're all subsets (with respect to name and order) of the largest dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I think a comment for this function explaining that logic would be helpful. IMO, it's a little confusing to have multiple ways of handling multi-dictionary inputs within the engine. MultiVariantWalker seems to just use the first vcf's dictionary which also seems arbitrary and contains some costly code to check the dictionaries for consistency.

Maybe we should provide a way for subclasses to choose what kind of dictionary check is performed? That way you have both options for both MultiVariantWalker and FeatureMergingWalker. Perhaps the engine folks should weigh in here though.

* To use this walker you need only implement the abstract apply method in a class that declares
* a list of FeatureInputs as an argument.
*/
public abstract class FeatureMergingWalker<F extends Feature> extends WalkerBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about MultiFeatureWalker? Akin to MultiVariantWalker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the name? Seems like a fine suggestion to rename it MultiFeatureWalker, now that I understand MultiVariantWalker better. However, I'd rather not get bogged down in refactoring the MultiVariantWalker for this PR unless you or the engine team think it's important to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes here I was just suggesting a name change. I'll also leave it up to the engine team to decide whether they should actually be sharing code.

@droazen droazen self-requested a review March 2, 2022 19:53
@droazen droazen self-assigned this Mar 2, 2022
@tedsharpe
Copy link
Contributor Author

I think this is ready for another PR review.

@gatk-bot
Copy link

gatk-bot commented Mar 10, 2022

Travis reported job failures from build 38071
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 38071.13 logs
unit openjdk8 38071.3 logs

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tedsharpe I think this is almost ready! For the FeatureOutputCodec, I think we should err on the side of caution here and do some extra checks to make sure inputs are as expected, even if there is some performance hit. I also have some suggestions on how to refactor a bit to make some of its functionality more re-usable for future tools.

Other than that, there are just some places that would benefit from some documentation, including the tool itself.

Comment on lines 17 to 18
Comparator<F> getSameLocusComparator();
void resolveSameLocusFeatures( PriorityQueue<F> queue, S sink );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they're public, can you make these methods safer to use in your implementations? I'd worry about bugs where someone (i.e. me) hands it features from different loci. Throw an error if that happens.

I do think that this functionality could be useful in future tools as well - being able to consume lists of feature files and merge them on the fly could come in handy. Can you move these to SVFeature instead? To be more general-purpose, consume a Collection<SVFeature> input and have an SVFeature output? Seems like most of the implementations are in the feature classes anyway.

It would be nice to have SVFeature or the evidence classes themselves implement comparable more generally, and enforce stable ordering in the output, even if there's a small performance hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved resolution of same-locus features into a general post-processing companion class (which would be available for other situations). One could hook these up as part of a streaming operation, I think, which would allow your suggested use case.
I really wish that features could implement Comparable. Alas, they require a dictionary to compare. This could be encapsulated in a Comparable, but then you'd have to pass the comparable around (instead of passing a dictionary around), and so you're no further ahead.

* </pre>
*
* @author Mark Walker &lt;markw@broadinstitute.org&gt;
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to need tool documentation here - you can base if off the old documentation. Make sure to enumerate the different constraints on each evidence type, and expected behavior for special cases such as filling in "missing values" for depth evidence.

LocusDepth lastEvidence = queue.poll();
while ( !queue.isEmpty() ) {
final LocusDepth evidence = queue.poll();
if ( comparator.compare(lastEvidence, evidence) == 0 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach I would also worry about bugs where the order is not what we expect - i.e. queue is not built with the Feature's comparator. Can you make this throw an error if this is the case by checking queue.comparator() (and for the other features as well)?


@VisibleForTesting
public final class LocusDepth implements SVFeature {
private final String contig;
private final int position;
private final String sample;
private final byte refCall; // index into nucleotideValues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that we probably don't need to know the reference base for SV calling. Can always be looked up anyhow.

*/
public Set<String> getSampleNames() { return samples; }

private void setDictionaryAndSamples() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation with an overview of the logic here?

return largeDict;
}

public static final class MergingIterator<F extends Feature> implements Iterator<PQEntry<F>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a quick doc here


import htsjdk.samtools.SAMSequenceDictionary;
import org.broadinstitute.hellbender.utils.Utils;

import java.util.List;

public class FeaturesHeader {
public class SVFeaturesHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm fine with it if it's only used for SV classes

import java.util.ArrayList;
import java.util.List;

public final class FeatureOutputCodecFinder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a quick comment here explaining why this is needed would be helpful

if ( count != MISSING_DATA ) {
if ( evCounts[idx] == MISSING_DATA ) {
evCounts[idx] = tmpCounts[idx];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of neat that we can "patch" missing regions if needed. Make sure to document this here and in the tool doc.

Comment on lines 23 to 24
public final static Comparator<DiscordantPairEvidence> comparator =
Comparator.comparing(DiscordantPairEvidence::getSample);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could always return 0? Unless there's an important reason to sort the output by sample name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think if you still need these comparators after implementing Comparable, they should be private and renamed to something like sameLocusComparator

@mwalker174
Copy link
Contributor

Thank you @tedsharpe for addressing my comments, I just have two additional requests:

  1. The tool still needs a Doc comment header (eg see SelectVariants example) to generate doc for the GATK Tool Index. It can be minimal with just "Inputs", "Outputs", and "Usage examples" sections.
  2. It would be nice to cut down on repeated code in the EvidenceSortMerger classes - it seems like the methods are all nearly the same, with the exception of the Comparator definition and the inner loop code in resolveSameLocusFeatures(). Could you move the common code up to an abstract SVEvidenceSortMerger class?

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good to me. @droazen did you want to review as well?

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe I've posted a review on the changes to engine-level classes -- a couple of requests but nothing major.

* Operations performed just prior to the start of traversal.
*/
@Override
public void onTraversalStart() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onTraversalStart() should generally be reserved for tool authors to override to perform whatever initialization their tool needs. Initialization for Walker classes themselves should be done by overriding onStartup(), making it final, and calling super.onStartup(); as the first line, as seen in, eg., FeatureWalker.

One of these days we really need to rename onStartup() to something like initializeTraversal() or initializeEngine() to make this distinction clearer.

* To use this walker you need only implement the abstract apply method in a class that declares
* a collection of FeatureInputs as an argument.
*/
public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an ExampleMultiFeatureWalker in org.broadinstitute.hellbender.tools.examples + an ExampleMultiFeatureWalkerIntegrationTest, modeled after the existing ExampleFeatureWalker and ExampleFeatureWalkerIntegrationTest? We generally try to do this for each new traversal added to GATK both as examples for tool authors and to catch regressions in the Walker classes.

* multiple sources of Features. The input files for each feature must be sorted by locus.
*
* To use this walker you need only implement the abstract apply method in a class that declares
* a collection of FeatureInputs as an argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take my suggestion below to override onStartup() instead of onTraversalStart(), add to the docs here:

and may optionally implement {@link #onTraversalStart()}, {@link #onTraversalSuccess()}, and/or {@link #closeTool()}.

which is the usual pattern for Walker classes

* @param feature Current Feature being processed.
* @param header Header object for the source from which the feature was drawn (may be null)
*/
public abstract void apply( final F feature, final Object header );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any potential use case for having overlapping reference bases and/or reads as a side input here? If that's outside the scope of this traversal then that's fine, but if it could be potentially useful for future tools consider adding a ReferenceContext and/or ReadsContext to apply(), as seen in FeatureWalker and most of the other Walker classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ReadsContext and ReferenceContext to apply method? OK, I've done so.
[Editorializing of the worst sort: I've done so because consistency seems more important right now than design, but it bugs me. It seems crazy to pass dummy objects that might not even be useful (when unbacked) when we don't know whether we even need them. Seems to me that it would've been much smarter to provide a walker callback method that the apply methods could use if they need this info. Please put it on your list of design annoyances that we'll never have time to fix. Or maybe there's a good justification that has simply escaped me.]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe GATK has always been structured around universal arguments like -R and -I that all tools accept, and that (when provided) automatically populate contextual objects passed in to the tools. The main justification for this design is to not require any extra work for the tool authors when overlapping data from other sources is required, and to have the most common kinds of inputs "wired up" in advance and available for the tool to use as needed. Tools can call the hasReference(), etc., methods from GATKTool, or the hasBackingDataSource() methods on the context objects themselves if they need to query whether a particular kind of contextual data is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I get that. IMHO, it's not really extra work to get the objects you need when you actually need them, and that's better than preparing dummy objects that you might not even want or use that get passed to you every time. I know that's how it's always worked, but it seems silly to me. I revised the code to do things the standard way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little more discoverable / self-documenting this way -- a reminder that the engine can provide you with all these side inputs if you need them. Plus there is no overhead to having these contextual objects around: they are all implemented in terms of lazy queries that don't happen until you access them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Comes from the Ministry of Silly Walkers. That's my story, and I'm stickin' to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, well, it may not convince you, but here's one last defense: the context objects are all initialized with the interval of the primary record -- since they are "tied" to the primary record in that sense, it makes sense that they should be passed in during the traversal along with the primary record.

}
}

private SAMSequenceDictionary bestDictionary( final SAMSequenceDictionary newDict,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a method comment documenting the approach taken for selecting the "best" dictionary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach for "best" dictionary was documented in setDictionaryAndSamples, but I've rejiggered it to make it less easily missed. (The bestDictionary method is now a static betterDictionary method, which is what was actually going on.)

@@ -473,8 +473,8 @@ final void printCacheStats() {
public SAMSequenceDictionary getSequenceDictionary() {
SAMSequenceDictionary dict = null;
final Object header = getHeader();
if ( header instanceof FeaturesHeader ) {
dict = ((FeaturesHeader)header).getDictionary();
if ( header instanceof SVFeaturesHeader ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind the FeaturesHeader -> SVFeaturesHeader migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why FeaturesHeader --> SVFeaturesHeader? It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.

import java.util.ArrayList;
import java.util.List;

public class MultiFeatureWalkerUnitTest extends CommandLineProgramTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think this can become the ExampleMultiFeatureWalkerIntegrationTest I requested above if you promote your DummyMultiFeatureWalker into an official ExampleMultiFeatureWalker in org.broadinstitute.hellbender.tools.examples

"-" + StandardArgumentDefinitions.INPUT_SHORT_NAME, largeFileTestDir + "NA12878.alignedHg38.duplicateMarked.baseRealigned.bam",
// no dictionary, no sample names, with a single feature
"-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt"
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using ArgumentsBuilder to build the command lines for your test cases.

// no dictionary, no sample names, with a single feature
"-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt"
};
dummy.instanceMain(args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use runCommandLine(args) after this becomes ExampleMultiFeatureWalkerIntegrationTest

lastStart = feature.getStart();
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need at least one test case that uses -L <intervals>

@tedsharpe
Copy link
Contributor Author

Remainder of review comments addressed directly with no guff.

@gatk-bot
Copy link

gatk-bot commented Apr 4, 2022

Travis reported job failures from build 38460
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 38460.1 logs
cloud openjdk11 38460.14 logs
unit openjdk11 38460.13 logs

@tedsharpe
Copy link
Contributor Author

OK. Passes checks. @droazen One final look? I couldn't figure out how to use runCommandLine, because I needed access to my walker instance. Is this OK, or is there a better way to do it?

@tedsharpe
Copy link
Contributor Author

@droazen Can I get a thumbs up on this if I've addressed your suggestions adequately, or further input if not? Thanks.

@droazen
Copy link
Contributor

droazen commented Apr 6, 2022

@tedsharpe I'll take a look now!

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe Engine changes look good now -- a few last easy comments, then go ahead and merge once they're addressed and tests pass (assuming of course that the changes in the sv package, which I didn't look at, have been fully reviewed)

}
if ( newIdx <= lastIdx ) {
throw new UserException("Contig " + rec.getContig() +
" not in same order as in larger dictionary");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are exceptions that actual users are likely to trigger, it would be helpful to include the contents of both dictionaries in the error messages. You can do this by throwing a UserException.IncompatibleSequenceDictionaries and passing the two dictionaries in.

public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase {

SAMSequenceDictionary dictionary;
final Set<String> samples = new TreeSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are accessor methods for these, can they be private? And relatedly, should the Set of samples be wrapped in a Collections.unmodifiableSet() after initialization to prevent downstream modification?

* internal state as possible.
*
* @param feature Current Feature being processed.
* @param header Header object for the source from which the feature was drawn (may be null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for the all-important ReadsContext and ReferenceContext args

final ReadsContext readsContext,
final ReferenceContext referenceContext ) {
// We'll just keep track of the Features we see, in the order that we see them.
features.add(feature);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little inconsistent with the rest of the example walkers, which all produce some diagnostic output showing the records that are processed by the traversal. Could you have this tool produce some textual output along the lines of ExampleFeatureWalker in addition to accumulating the Features in memory?

@tedsharpe tedsharpe merged commit dc35e76 into master Apr 7, 2022
@tedsharpe tedsharpe deleted the tws_MergeSVEvidence branch April 7, 2022 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants