Added documentation for TrainVariantAnnotationsModel and addressed re…

…view comments.
broadinstitute · Aug 8, 2022 · b3d6fec · b3d6fec
1 parent c507f96
commit b3d6fec
Show file tree

Hide file tree

Showing 153 changed files with 430 additions and 292 deletions.
diff --git a/.../org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/ExtractVariantAnnotations.java b/.../org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/ExtractVariantAnnotations.java
@@ -34,19 +34,19 @@
  *     This tool is intended to be used as the first step in a variant-filtering workflow that supersedes the
  *     {@link VariantRecalibrator} workflow. This tool extracts site-level annotations, labels, and other relevant metadata
  *     from variant sites (or alleles, in allele-specific mode) that are or are not present in specified labeled
- *     resource VCFs (e.g., training or calibration VCFs). The former, present sites are considered labeled; each site
- *     can have multiple labels. The latter sites are considered unlabeled and can be randomly downsampled using
- *     reservoir sampling; extraction of these is optional. The outputs of the tool are HDF5 files containing the
- *     extracted data for labeled and (optional) unlabeled variant sets, as well as a sites-only indexed VCF containing
- *     the labeled variants.
+ *     resource VCFs (e.g., training or calibration VCFs). Input sites that are present in the resources are considered
+ *     labeled; each site can have multiple labels if it is present in multiple resources. Other input sites that are
+ *     not present in any resources are considered unlabeled and can be randomly sampled using reservoir sampling;
+ *     extraction of these is optional. The outputs of the tool are HDF5 files containing the extracted data for
+ *     labeled and (optional) unlabeled variant sets, as well as a sites-only indexed VCF containing the labeled variants.
  * </p>
  * 
  * <p>
  *     The extracted sets can be provided as input to the {@link TrainVariantAnnotationsModel} tool
  *     to produce an annotation-based model for scoring variant calls. This model can in turn be provided
  *     along with a VCF file to the {@link ScoreVariantAnnotations} tool, which assigns a score to each call
  *     (with a lower score indicating that a call is more likely to be an artifact and should perhaps be filtered).
- *     Each score can also be converted to a corresponding sensitivity to a calibration set, if the latter is available.
+ *     Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.
  * </p>
  *
  * <p>
@@ -87,7 +87,7 @@
  *         provided resources.
  *     </li>
  *     <li>
- *         (Optional) Maximum number of unlabeled variants (or alleles) to randomly sample with reservoir downsampling.
+ *         (Optional) Maximum number of unlabeled variants (or alleles) to randomly sample with reservoir sampling.
  *         If nonzero, annotations will also be extracted from unlabeled sites (i.e., those that are not present
  *         in the labeled resources).
  *     </li>
@@ -160,7 +160,8 @@
  *     Extract annotations from training/calibration SNP/INDEL sites, producing the outputs
  *     1) {@code extract.annot.hdf5}, 2) {@code extract.vcf.gz}, and 3) {@code extract.vcf.gz.tbi}.
  *     The HDF5 file can then be provided to {@link TrainVariantAnnotationsModel}
- *     to train a model using a positive-only approach.
+ *     to train a model using a positive-only approach. Note that the {@value MODE_LONG_NAME} arguments are made
+ *     explicit here, although both SNP and INDEL modes are selected by default.
  *
  * <pre>
  *     gatk ExtractVariantAnnotations \
@@ -184,6 +185,8 @@
  *     1) {@code extract.annot.hdf5}, 2) {@code extract.unlabeled.annot.hdf5}, 3) {@code extract.vcf.gz},
  *     and 4) {@code extract.vcf.gz.tbi}. The HDF5 files can then be provided to {@link TrainVariantAnnotationsModel}
  *     to train a model using a positive-negative approach (similar to that used in {@link VariantRecalibrator}).
+ *     Note that the {@value MODE_LONG_NAME} arguments are made explicit here, although both SNP and INDEL modes are
+ *     selected by default.
  *
  * <pre>
  *     gatk ExtractVariantAnnotations \
@@ -207,7 +210,8 @@
  *     unlabeled sites, producing the outputs 1) {@code extract.unlabeled.annot.hdf5},
  *     2) {@code extract.vcf.gz} (which will contain no records), and 3) {@code extract.vcf.gz.tbi}.
  *     This random sample cannot be used by {@link TrainVariantAnnotationsModel}, but may still be useful for
- *     exploratory analyses.
+ *     exploratory analyses. Note that the {@value MODE_LONG_NAME} arguments are made explicit here, although both
+ *     SNP and INDEL modes are selected by default.
  *
  * <pre>
  *     gatk ExtractVariantAnnotations \
@@ -244,7 +248,10 @@ public final class ExtractVariantAnnotations extends LabeledVariantAnnotationsWa
             fullName = MAXIMUM_NUMBER_OF_UNLABELED_VARIANTS_LONG_NAME,
             doc = "Maximum number of unlabeled variants to extract. " +
                     "If greater than zero, reservoir sampling will be used to randomly sample this number " +
-                    "of sites from input sites that are not present in the specified resources.",
+                    "of sites from input sites that are not present in the specified resources. " +
+                    "Choice of this number should be guided by considerations for training the negative model in " +
+                    "TrainVariantAnnotationsModel; users may wish to choose a number that is comparable to the " +
+                    "expected size of the labeled training set or that is compatible with available memory resources.",
             minValue = 0)
     private int maximumNumberOfUnlabeledVariants = 0;
 
@@ -266,16 +273,16 @@ public void afterOnTraversalStart() {
         }
         if (!resourceLabels.contains(LabeledVariantAnnotationsData.CALIBRATION_LABEL)) {
             logger.warn("No calibration set found! If you are using the downstream TrainVariantAnnotationsModel and ScoreVariantAnnotations tools " +
-                    "and wish to convert scores to sensitivity to a calibration set of variants, " +
+                    "and wish to convert scores to sensitivity with respect to a calibration set of variants, " +
                     "provide sets of known polymorphic loci marked with the calibration=true feature input tag. " +
                     "For example, --resource:hapmap,calibration=true hapmap.vcf");
         }
 
         rng = RandomGeneratorFactory.createRandomGenerator(new Random(reservoirSamplingRandomSeed));
         unlabeledDataReservoir = maximumNumberOfUnlabeledVariants == 0
                 ? null
-                : new LabeledVariantAnnotationsData(annotationNames, resourceLabels, useASAnnotations, maximumNumberOfUnlabeledVariants);
-    }
+                : new LabeledVariantAnnotationsData(annotationNames, resourceLabels, useASAnnotations, maximumNumberOfUnlabeledVariants);   // we pass resourceLabels here so that both labeled and unlabeled
+    }                                                                                                                                       // HDF5 files will have the same directory structure
 
     @Override
     protected void nthPassApply(final VariantContext variant,
@@ -345,7 +352,8 @@ private static void setExtractedVariantInData(final LabeledVariantAnnotationsDat
     private void writeUnlabeledAnnotationsToHDF5() {
         final File outputUnlabeledAnnotationsFile = new File(outputPrefix + UNLABELED_TAG + ANNOTATIONS_HDF5_SUFFIX);
         if (unlabeledDataReservoir.size() == 0) {
-            throw new GATKException("No unlabeled variants were present in the input VCF.");
+            throw new GATKException(String.format("No unlabeled variants were present in the input VCF. " +
+                    "Consider setting the %s argument to 0.", MAXIMUM_NUMBER_OF_UNLABELED_VARIANTS_LONG_NAME));
         }
         for (final VariantType variantType : variantTypesToExtract) {
             logger.info(String.format("Extracted unlabeled annotations for %d variants of type %s.",

diff --git a/...roadinstitute/hellbender/tools/walkers/vqsr/scalable/LabeledVariantAnnotationsWalker.java b/...roadinstitute/hellbender/tools/walkers/vqsr/scalable/LabeledVariantAnnotationsWalker.java
@@ -13,8 +13,6 @@
 import org.apache.commons.collections4.ListUtils;
 import org.apache.commons.lang3.tuple.Triple;
 import org.broadinstitute.barclay.argparser.Argument;
-import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
-import org.broadinstitute.barclay.help.DocumentedFeature;
 import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
 import org.broadinstitute.hellbender.engine.FeatureContext;
 import org.broadinstitute.hellbender.engine.FeatureInput;
@@ -28,7 +26,6 @@
 import org.broadinstitute.hellbender.utils.variant.GATKVCFHeaderLines;
 import org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils;
 import org.broadinstitute.hellbender.utils.variant.VcfUtils;
-import picard.cmdline.programgroups.VariantFilteringProgramGroup;
 
 import java.io.File;
 import java.util.ArrayList;
@@ -83,15 +80,8 @@
  * This results in the following output:
  *
  *   - an HDF5 file, as above
- *   - a VCF file, containing the input variants, with labels and scores appended for those passing variant-type checks TODO + calibration-sensitivity scores + filters applied?
+ *   - a VCF file, containing the input variants, with labels, scores, and filters appended/applied for those passing variant-type checks
  */
-@CommandLineProgramProperties(
-        // TODO
-        summary = "",
-        oneLineSummary = "",
-        programGroup = VariantFilteringProgramGroup.class
-)
-@DocumentedFeature
 public abstract class LabeledVariantAnnotationsWalker extends MultiplePassVariantWalker {
 
     public static final String MODE_LONG_NAME = "mode";
@@ -204,7 +194,7 @@ public void onTraversalStart() {
             resourceLabels.addAll(trackResourceLabels);
             logger.info( String.format("Found %s track: labels = %s", resource.getName(), trackResourceLabels));
         }
-        resourceLabels.forEach(String::intern);
+        resourceLabels.forEach(String::intern); // TODO evaluate if this affects memory usage and remove if not needed
 
         if (resourceLabels.contains(LabeledVariantAnnotationsData.SNP_LABEL)) {
             throw new UserException.BadInput(String.format("The resource label \"%s\" is reserved for labeling variant types.",
@@ -233,7 +223,6 @@ public Object onTraversalSuccess() {
         return null;
     }
 
-    // TODO maybe clean up all this Triple and metadata business with a class?
     static void addExtractedVariantToData(final LabeledVariantAnnotationsData data,
                                           final VariantContext variant,
                                           final List<Triple<List<Allele>, VariantType, TreeSet<String>>> metadata) {
@@ -289,13 +278,13 @@ void writeExtractedVariantToVCF(final VariantContext vc,
     // modified from VQSR code
     // TODO we're just writing a standard sites-only VCF here, maybe there's a nicer way to do this?
     VCFHeader constructVCFHeader(final List<String> sortedLabels) {
-        Set<VCFHeaderLine> hInfo = getDefaultToolVCFHeaderLines();
-        hInfo.addAll(sortedLabels.stream()
+        Set<VCFHeaderLine> hInfo = sortedLabels.stream()
                 .map(l -> new VCFInfoHeaderLine(l, 1, VCFHeaderLineType.Flag, String.format(RESOURCE_LABEL_INFO_HEADER_LINE_FORMAT_STRING, l)))
-                .collect(Collectors.toList()));
+                .collect(Collectors.toCollection(TreeSet::new));
         hInfo.add(GATKVCFHeaderLines.getFilterLine(VCFConstants.PASSES_FILTERS_v4));
         final SAMSequenceDictionary sequenceDictionary = getBestAvailableSequenceDictionary();
         hInfo = VcfUtils.updateHeaderContigLines(hInfo, null, sequenceDictionary, true);
+        hInfo.addAll(getDefaultToolVCFHeaderLines());
         return new VCFHeader(hInfo);
     }
 
@@ -328,8 +317,8 @@ final List<Triple<List<Allele>, VariantType, TreeSet<String>>> extractVariantMet
             // corresponding to alt alleles that pass variant-type and overlapping-resource checks
             return vc.getAlternateAlleles().stream()
                     .filter(a -> !GATKVCFConstants.isSpanningDeletion(a))
-                    .filter(a -> variantTypesToExtract.contains(VariantType.getVariantType(vc, a)))
-                    .map(a -> Triple.of(Collections.singletonList(a), VariantType.getVariantType(vc, a),
+                    .filter(a -> variantTypesToExtract.contains(VariantType.getAlleleSpecificVariantType(vc, a)))
+                    .map(a -> Triple.of(Collections.singletonList(a), VariantType.getAlleleSpecificVariantType(vc, a),
                             findOverlappingResourceLabels(vc, vc.getReference(), a, featureContext)))
                     .filter(t -> isExtractUnlabeled || !t.getRight().isEmpty())
                     .collect(Collectors.toList());

diff --git a/...va/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/ScoreVariantAnnotations.java b/...va/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/ScoreVariantAnnotations.java
@@ -21,7 +21,6 @@
 import org.broadinstitute.hellbender.exceptions.GATKException;
 import org.broadinstitute.hellbender.exceptions.UserException;
 import org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils;
-import org.broadinstitute.hellbender.tools.copynumber.utils.HDF5Utils;
 import org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator;
 import org.broadinstitute.hellbender.tools.walkers.vqsr.scalable.data.LabeledVariantAnnotationsData;
 import org.broadinstitute.hellbender.tools.walkers.vqsr.scalable.data.VariantType;
@@ -56,15 +55,15 @@
  *     This tool is intended to be used as the last step in a variant-filtering workflow that supersedes the
  *     {@link VariantRecalibrator} workflow. Using a previously trained model produced by {@link TrainVariantAnnotationsModel},
  *     this tool assigns a score to each call (with a lower score indicating that a call is more likely to be an artifact).
- *     Each score can also be converted to a corresponding sensitivity to a calibration set, if the latter is available.
+ *     Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.
  *     Each VCF record can also be annotated with additional resource labels and/or hard filtered based on its
  *     calibration-set sensitivity, if desired.
  * </p>
  *
  * <p>
  *     Note that annotations and metadata are collected in memory during traversal until they are written to HDF5 files
- *     upon completion of the traversal. Memory requirements thus roughly scale linearly with both the number of sites
- *     scored and the number of annotations. For large callsets, this tool may be run in parallel over separate
+ *     upon completion of the traversal. Memory and disk requirements thus roughly scale linearly with both the number
+ *     of sites scored and the number of annotations. For large callsets, this tool may be run in parallel over separate
  *     genomic shards using the {@value StandardArgumentDefinitions#INTERVALS_LONG_NAME} argument as usual.
  * </p>
  *
@@ -180,6 +179,8 @@
  *     produced by {@link ExtractVariantAnnotations}. Records will be filtered according to the values provided to the
  *     {@value SNP_CALIBRATION_SENSITIVITY_THRESHOLD_LONG_NAME} and {@value INDEL_CALIBRATION_SENSITIVITY_THRESHOLD_LONG_NAME}
  *     arguments; the values below are only meant to be illustrative and should be set as appropriate for a given analysis.
+ *     Note that the {@value MODE_LONG_NAME} arguments are made explicit here, although both SNP and INDEL modes are
+ *     selected by default.
  *
  * <pre>
  *     gatk ScoreVariantAnnotations \
@@ -245,7 +246,7 @@
  *     See documentation in the modeling and scoring interfaces ({@link VariantAnnotationsModel} and
  *     {@link VariantAnnotationsScorer}, respectively), as well as the default Python IsolationForest implementation at
  *     {@link PythonSklearnVariantAnnotationsScorer} and
- *     org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.
+ *     src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.
  * </p>
  *
  * DEVELOPER NOTE: See documentation in {@link LabeledVariantAnnotationsWalker}.
@@ -282,7 +283,8 @@ public class ScoreVariantAnnotations extends LabeledVariantAnnotationsWalker {
     public static final String SCORES_HDF5_SUFFIX = ".scores.hdf5";
 
     @Argument(
-            fullName = MODEL_PREFIX_LONG_NAME)
+            fullName = MODEL_PREFIX_LONG_NAME,
+            doc = "Prefix for model files. This should be identical to the output prefix specified in TrainVariantAnnotationsModel." )
     private String modelPrefix;
 
     @Argument(
@@ -595,22 +597,22 @@ private String formatDouble(final double x) {
     @Override
     VCFHeader constructVCFHeader(final List<String> sortedLabels) {
         final VCFHeader inputHeader = getHeaderForVariants();
-        final Set<VCFHeaderLine> inputHeaders = inputHeader.getMetaDataInSortedOrder();
+        final Set<VCFHeaderLine> sortedInputHeaderMetaData = inputHeader.getMetaDataInSortedOrder();
 
-        final Set<VCFHeaderLine> hInfo = new HashSet<>(inputHeaders);
+        final Set<VCFHeaderLine> hInfo = new HashSet<>(sortedInputHeaderMetaData);
         hInfo.add(new VCFInfoHeaderLine(scoreKey, 1, VCFHeaderLineType.Float,
                 "Score according to the model applied by ScoreVariantAnnotations"));
         hInfo.add(new VCFInfoHeaderLine(calibrationSensitivityKey, 1, VCFHeaderLineType.Float,
                 String.format("Calibration sensitivity corresponding to the value of %s", scoreKey)));
         hInfo.add(new VCFFilterHeaderLine(lowScoreFilterName, "Low score (corresponding to high calibration sensitivity)"));
 
-        hInfo.addAll(getDefaultToolVCFHeaderLines());
         if (snpKey != null) {
             hInfo.add(new VCFInfoHeaderLine(snpKey, 1, VCFHeaderLineType.Flag, "This site was considered a SNP during filtering"));
         }
         hInfo.addAll(sortedLabels.stream()
                 .map(l -> new VCFInfoHeaderLine(l, 1, VCFHeaderLineType.Flag, String.format(RESOURCE_LABEL_INFO_HEADER_LINE_FORMAT_STRING, l)))
                 .collect(Collectors.toList()));
+        hInfo.addAll(getDefaultToolVCFHeaderLines());
 
         return new VCFHeader(hInfo, inputHeader.getGenotypeSamples());
     }