Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049

samuelklee · 2022-10-11T16:52:57Z

Ticks off a few straggler issues noted in #7724.

@meganshand mind reviewing? Hopefully should be quick and we can get it in before @droazen cuts the next release. Note that this shouldn't change behavior in the Ultima pipeline, as the default toggle is still the same start-position resource-matching strategy inherited from VQSR, but we might want to explore the effect of choosing another strategy there.

…eous minor fixes to new annotation-based filtering tools.

samuelklee · 2022-10-11T16:54:24Z

...g/broadinstitute/hellbender/tools/walkers/vqsr/scalable/LabeledVariantAnnotationsWalker.java

@@ -46,7 +48,7 @@
 * walker, performing the operations:
 *
 *   - nthPassApply(n = 0)
- *      - if variant/alleles pass filters and variant-type/overlapping-resource checks, then:
+ *      - if variant/alleles pass filters and variant-type/resource-match checks, then:


Note that I just changed "overlapping" to "resource-matching" everywhere to distinguish what we're doing here from straightforward genomic overlapping.

samuelklee · 2022-10-11T16:55:11Z

...g/broadinstitute/hellbender/tools/walkers/vqsr/scalable/LabeledVariantAnnotationsWalker.java

@@ -283,7 +304,9 @@ VCFHeader constructVCFHeader(final List<String> sortedLabels) {
                .collect(Collectors.toCollection(TreeSet::new));
        hInfo.add(GATKVCFHeaderLines.getFilterLine(VCFConstants.PASSES_FILTERS_v4));
        final SAMSequenceDictionary sequenceDictionary = getBestAvailableSequenceDictionary();
-        hInfo = VcfUtils.updateHeaderContigLines(hInfo, null, sequenceDictionary, true);
+        if (sequenceDictionary != null) {


This avoids a complaint raised if the VCF is missing contig lines in the header.

samuelklee · 2022-10-11T16:56:43Z

...g/broadinstitute/hellbender/tools/walkers/vqsr/scalable/LabeledVariantAnnotationsWalker.java

+            switch (resourceMatchingStrategy) {
+                case START_POSITION:
+                    return true;
+                case START_POSITION_AND_GIVEN_REPRESENTATION:


Not sure if these strategies add any noticeable overhead if there are a lot of multiallelics, but I haven't noticed anything out of the ordinary on my runs so far.

meganshand

It would be nice if you could add some tests for this (even a unit test of isMatchingVariant), but you can make that in a separate PR if you're in a hurry to get this in now.

Even START_POSITION_AND_MINIMAL_RERPRESENTATION is not as thorough as RTG VCFEval, right? That does a more complete adjustment to resolve different representations? It would be nice to use their code somehow if it makes a big enough difference. Maybe that's easy enough to achieve through pipelining. Sorry, these are musings not requests. 👍

samuelklee · 2022-10-11T17:51:29Z

Thanks for the lightning-quick review! And good thought to add tests---probably just adding them to the current suite of exact-match tests would hopefully suffice, since we rely on existing/library methods (which presumably already have correctness tests) to do the matching. I'll add it to the straggler issues 😆

And yes, I don't think the matching here is as sophisticated as that done by VCFEval, but it's probably good enough for the purposes of identifying training variants. I am actually curious how often the start-position strategy gets us into trouble (e.g., we hit an artifact at a multiallelic site, or annotations at multiallelic sites are somehow distributed differently even if all alleles are real, etc.)

…eous minor fixes to new annotation-based filtering tools. (#8049)

* Added a new suite of tools for variant filtering based on site-level annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <mshand@broadinstitute.org> * Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049) * Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027) * Small changes to JointVCFFiltering WDL * making default for use_allele_specific_annotations * addressing comments * first stab * wire through WDL changes * fixed typo * set model_backend input value * add gatk_override to JointVcfFiltering call * typo in indel_annotations * make model_backend optional * tabs and spaces * make all model_backends optional * use gatk 4.3.0 * no point in changing the table names as this is a POC * adding new branch to dockstore * adding in branching logic for classic VQSR vs VQSR-Lite * implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files * passing classic flag to indel run of CreateFilteringFiles * Update GvsCreateFilterSet.wdl cleaning up verbiage * Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261) * cleanup up sloppy comment --------- Co-authored-by: samuelklee <samuelklee@users.noreply.github.com> Co-authored-by: meganshand <mshand@broadinstitute.org> Co-authored-by: Rebecca Asch <rasch@broadinstitute.org>

Added toggle for selecting resource-matching strategies and miscellan…

7453d04

…eous minor fixes to new annotation-based filtering tools.

samuelklee commented Oct 11, 2022

View reviewed changes

meganshand approved these changes Oct 11, 2022

View reviewed changes

samuelklee mentioned this pull request Oct 11, 2022

New tools for annotation-based filtering. #7724

Open

samuelklee merged commit fd78250 into master Oct 11, 2022

samuelklee deleted the sl_lite_overlap branch October 11, 2022 18:39

rsasch pushed a commit that referenced this pull request Oct 17, 2022

Added toggle for selecting resource-matching strategies and miscellan…

78fdd8d

…eous minor fixes to new annotation-based filtering tools. (#8049)

samuelklee mentioned this pull request Feb 2, 2023

Performed a round of ablation on new annotation-based filtering tools. #8131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049

Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049

samuelklee commented Oct 11, 2022

samuelklee Oct 11, 2022

samuelklee Oct 11, 2022

samuelklee Oct 11, 2022

meganshand left a comment

samuelklee commented Oct 11, 2022

Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049

Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049

Conversation

samuelklee commented Oct 11, 2022

samuelklee Oct 11, 2022

Choose a reason for hiding this comment

samuelklee Oct 11, 2022

Choose a reason for hiding this comment

samuelklee Oct 11, 2022

Choose a reason for hiding this comment

meganshand left a comment

Choose a reason for hiding this comment

samuelklee commented Oct 11, 2022