Skip to content

Commit

Permalink
Added documentation for TrainVariantAnnotationsModel and addressed re…
Browse files Browse the repository at this point in the history
…view comments.
  • Loading branch information
samuelklee committed Aug 8, 2022
1 parent c507f96 commit b3d6fec
Show file tree
Hide file tree
Showing 153 changed files with 430 additions and 292 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,19 @@
* This tool is intended to be used as the first step in a variant-filtering workflow that supersedes the
* {@link VariantRecalibrator} workflow. This tool extracts site-level annotations, labels, and other relevant metadata
* from variant sites (or alleles, in allele-specific mode) that are or are not present in specified labeled
* resource VCFs (e.g., training or calibration VCFs). The former, present sites are considered labeled; each site
* can have multiple labels. The latter sites are considered unlabeled and can be randomly downsampled using
* reservoir sampling; extraction of these is optional. The outputs of the tool are HDF5 files containing the
* extracted data for labeled and (optional) unlabeled variant sets, as well as a sites-only indexed VCF containing
* the labeled variants.
* resource VCFs (e.g., training or calibration VCFs). Input sites that are present in the resources are considered
* labeled; each site can have multiple labels if it is present in multiple resources. Other input sites that are
* not present in any resources are considered unlabeled and can be randomly sampled using reservoir sampling;
* extraction of these is optional. The outputs of the tool are HDF5 files containing the extracted data for
* labeled and (optional) unlabeled variant sets, as well as a sites-only indexed VCF containing the labeled variants.
* </p>
*
* <p>
* The extracted sets can be provided as input to the {@link TrainVariantAnnotationsModel} tool
* to produce an annotation-based model for scoring variant calls. This model can in turn be provided
* along with a VCF file to the {@link ScoreVariantAnnotations} tool, which assigns a score to each call
* (with a lower score indicating that a call is more likely to be an artifact and should perhaps be filtered).
* Each score can also be converted to a corresponding sensitivity to a calibration set, if the latter is available.
* Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.
* </p>
*
* <p>
Expand Down Expand Up @@ -87,7 +87,7 @@
* provided resources.
* </li>
* <li>
* (Optional) Maximum number of unlabeled variants (or alleles) to randomly sample with reservoir downsampling.
* (Optional) Maximum number of unlabeled variants (or alleles) to randomly sample with reservoir sampling.
* If nonzero, annotations will also be extracted from unlabeled sites (i.e., those that are not present
* in the labeled resources).
* </li>
Expand Down Expand Up @@ -160,7 +160,8 @@
* Extract annotations from training/calibration SNP/INDEL sites, producing the outputs
* 1) {@code extract.annot.hdf5}, 2) {@code extract.vcf.gz}, and 3) {@code extract.vcf.gz.tbi}.
* The HDF5 file can then be provided to {@link TrainVariantAnnotationsModel}
* to train a model using a positive-only approach.
* to train a model using a positive-only approach. Note that the {@value MODE_LONG_NAME} arguments are made
* explicit here, although both SNP and INDEL modes are selected by default.
*
* <pre>
* gatk ExtractVariantAnnotations \
Expand All @@ -184,6 +185,8 @@
* 1) {@code extract.annot.hdf5}, 2) {@code extract.unlabeled.annot.hdf5}, 3) {@code extract.vcf.gz},
* and 4) {@code extract.vcf.gz.tbi}. The HDF5 files can then be provided to {@link TrainVariantAnnotationsModel}
* to train a model using a positive-negative approach (similar to that used in {@link VariantRecalibrator}).
* Note that the {@value MODE_LONG_NAME} arguments are made explicit here, although both SNP and INDEL modes are
* selected by default.
*
* <pre>
* gatk ExtractVariantAnnotations \
Expand All @@ -207,7 +210,8 @@
* unlabeled sites, producing the outputs 1) {@code extract.unlabeled.annot.hdf5},
* 2) {@code extract.vcf.gz} (which will contain no records), and 3) {@code extract.vcf.gz.tbi}.
* This random sample cannot be used by {@link TrainVariantAnnotationsModel}, but may still be useful for
* exploratory analyses.
* exploratory analyses. Note that the {@value MODE_LONG_NAME} arguments are made explicit here, although both
* SNP and INDEL modes are selected by default.
*
* <pre>
* gatk ExtractVariantAnnotations \
Expand Down Expand Up @@ -244,7 +248,10 @@ public final class ExtractVariantAnnotations extends LabeledVariantAnnotationsWa
fullName = MAXIMUM_NUMBER_OF_UNLABELED_VARIANTS_LONG_NAME,
doc = "Maximum number of unlabeled variants to extract. " +
"If greater than zero, reservoir sampling will be used to randomly sample this number " +
"of sites from input sites that are not present in the specified resources.",
"of sites from input sites that are not present in the specified resources. " +
"Choice of this number should be guided by considerations for training the negative model in " +
"TrainVariantAnnotationsModel; users may wish to choose a number that is comparable to the " +
"expected size of the labeled training set or that is compatible with available memory resources.",
minValue = 0)
private int maximumNumberOfUnlabeledVariants = 0;

Expand All @@ -266,16 +273,16 @@ public void afterOnTraversalStart() {
}
if (!resourceLabels.contains(LabeledVariantAnnotationsData.CALIBRATION_LABEL)) {
logger.warn("No calibration set found! If you are using the downstream TrainVariantAnnotationsModel and ScoreVariantAnnotations tools " +
"and wish to convert scores to sensitivity to a calibration set of variants, " +
"and wish to convert scores to sensitivity with respect to a calibration set of variants, " +
"provide sets of known polymorphic loci marked with the calibration=true feature input tag. " +
"For example, --resource:hapmap,calibration=true hapmap.vcf");
}

rng = RandomGeneratorFactory.createRandomGenerator(new Random(reservoirSamplingRandomSeed));
unlabeledDataReservoir = maximumNumberOfUnlabeledVariants == 0
? null
: new LabeledVariantAnnotationsData(annotationNames, resourceLabels, useASAnnotations, maximumNumberOfUnlabeledVariants);
}
: new LabeledVariantAnnotationsData(annotationNames, resourceLabels, useASAnnotations, maximumNumberOfUnlabeledVariants); // we pass resourceLabels here so that both labeled and unlabeled
} // HDF5 files will have the same directory structure

@Override
protected void nthPassApply(final VariantContext variant,
Expand Down Expand Up @@ -345,7 +352,8 @@ private static void setExtractedVariantInData(final LabeledVariantAnnotationsDat
private void writeUnlabeledAnnotationsToHDF5() {
final File outputUnlabeledAnnotationsFile = new File(outputPrefix + UNLABELED_TAG + ANNOTATIONS_HDF5_SUFFIX);
if (unlabeledDataReservoir.size() == 0) {
throw new GATKException("No unlabeled variants were present in the input VCF.");
throw new GATKException(String.format("No unlabeled variants were present in the input VCF. " +
"Consider setting the %s argument to 0.", MAXIMUM_NUMBER_OF_UNLABELED_VARIANTS_LONG_NAME));
}
for (final VariantType variantType : variantTypesToExtract) {
logger.info(String.format("Extracted unlabeled annotations for %d variants of type %s.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@
import org.apache.commons.collections4.ListUtils;
import org.apache.commons.lang3.tuple.Triple;
import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
import org.broadinstitute.barclay.help.DocumentedFeature;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.engine.FeatureContext;
import org.broadinstitute.hellbender.engine.FeatureInput;
Expand All @@ -28,7 +26,6 @@
import org.broadinstitute.hellbender.utils.variant.GATKVCFHeaderLines;
import org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils;
import org.broadinstitute.hellbender.utils.variant.VcfUtils;
import picard.cmdline.programgroups.VariantFilteringProgramGroup;

import java.io.File;
import java.util.ArrayList;
Expand Down Expand Up @@ -83,15 +80,8 @@
* This results in the following output:
*
* - an HDF5 file, as above
* - a VCF file, containing the input variants, with labels and scores appended for those passing variant-type checks TODO + calibration-sensitivity scores + filters applied?
* - a VCF file, containing the input variants, with labels, scores, and filters appended/applied for those passing variant-type checks
*/
@CommandLineProgramProperties(
// TODO
summary = "",
oneLineSummary = "",
programGroup = VariantFilteringProgramGroup.class
)
@DocumentedFeature
public abstract class LabeledVariantAnnotationsWalker extends MultiplePassVariantWalker {

public static final String MODE_LONG_NAME = "mode";
Expand Down Expand Up @@ -204,7 +194,7 @@ public void onTraversalStart() {
resourceLabels.addAll(trackResourceLabels);
logger.info( String.format("Found %s track: labels = %s", resource.getName(), trackResourceLabels));
}
resourceLabels.forEach(String::intern);
resourceLabels.forEach(String::intern); // TODO evaluate if this affects memory usage and remove if not needed

if (resourceLabels.contains(LabeledVariantAnnotationsData.SNP_LABEL)) {
throw new UserException.BadInput(String.format("The resource label \"%s\" is reserved for labeling variant types.",
Expand Down Expand Up @@ -233,7 +223,6 @@ public Object onTraversalSuccess() {
return null;
}

// TODO maybe clean up all this Triple and metadata business with a class?
static void addExtractedVariantToData(final LabeledVariantAnnotationsData data,
final VariantContext variant,
final List<Triple<List<Allele>, VariantType, TreeSet<String>>> metadata) {
Expand Down Expand Up @@ -289,13 +278,13 @@ void writeExtractedVariantToVCF(final VariantContext vc,
// modified from VQSR code
// TODO we're just writing a standard sites-only VCF here, maybe there's a nicer way to do this?
VCFHeader constructVCFHeader(final List<String> sortedLabels) {
Set<VCFHeaderLine> hInfo = getDefaultToolVCFHeaderLines();
hInfo.addAll(sortedLabels.stream()
Set<VCFHeaderLine> hInfo = sortedLabels.stream()
.map(l -> new VCFInfoHeaderLine(l, 1, VCFHeaderLineType.Flag, String.format(RESOURCE_LABEL_INFO_HEADER_LINE_FORMAT_STRING, l)))
.collect(Collectors.toList()));
.collect(Collectors.toCollection(TreeSet::new));
hInfo.add(GATKVCFHeaderLines.getFilterLine(VCFConstants.PASSES_FILTERS_v4));
final SAMSequenceDictionary sequenceDictionary = getBestAvailableSequenceDictionary();
hInfo = VcfUtils.updateHeaderContigLines(hInfo, null, sequenceDictionary, true);
hInfo.addAll(getDefaultToolVCFHeaderLines());
return new VCFHeader(hInfo);
}

Expand Down Expand Up @@ -328,8 +317,8 @@ final List<Triple<List<Allele>, VariantType, TreeSet<String>>> extractVariantMet
// corresponding to alt alleles that pass variant-type and overlapping-resource checks
return vc.getAlternateAlleles().stream()
.filter(a -> !GATKVCFConstants.isSpanningDeletion(a))
.filter(a -> variantTypesToExtract.contains(VariantType.getVariantType(vc, a)))
.map(a -> Triple.of(Collections.singletonList(a), VariantType.getVariantType(vc, a),
.filter(a -> variantTypesToExtract.contains(VariantType.getAlleleSpecificVariantType(vc, a)))
.map(a -> Triple.of(Collections.singletonList(a), VariantType.getAlleleSpecificVariantType(vc, a),
findOverlappingResourceLabels(vc, vc.getReference(), a, featureContext)))
.filter(t -> isExtractUnlabeled || !t.getRight().isEmpty())
.collect(Collectors.toList());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
import org.broadinstitute.hellbender.exceptions.GATKException;
import org.broadinstitute.hellbender.exceptions.UserException;
import org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils;
import org.broadinstitute.hellbender.tools.copynumber.utils.HDF5Utils;
import org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator;
import org.broadinstitute.hellbender.tools.walkers.vqsr.scalable.data.LabeledVariantAnnotationsData;
import org.broadinstitute.hellbender.tools.walkers.vqsr.scalable.data.VariantType;
Expand Down Expand Up @@ -56,15 +55,15 @@
* This tool is intended to be used as the last step in a variant-filtering workflow that supersedes the
* {@link VariantRecalibrator} workflow. Using a previously trained model produced by {@link TrainVariantAnnotationsModel},
* this tool assigns a score to each call (with a lower score indicating that a call is more likely to be an artifact).
* Each score can also be converted to a corresponding sensitivity to a calibration set, if the latter is available.
* Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.
* Each VCF record can also be annotated with additional resource labels and/or hard filtered based on its
* calibration-set sensitivity, if desired.
* </p>
*
* <p>
* Note that annotations and metadata are collected in memory during traversal until they are written to HDF5 files
* upon completion of the traversal. Memory requirements thus roughly scale linearly with both the number of sites
* scored and the number of annotations. For large callsets, this tool may be run in parallel over separate
* upon completion of the traversal. Memory and disk requirements thus roughly scale linearly with both the number
* of sites scored and the number of annotations. For large callsets, this tool may be run in parallel over separate
* genomic shards using the {@value StandardArgumentDefinitions#INTERVALS_LONG_NAME} argument as usual.
* </p>
*
Expand Down Expand Up @@ -180,6 +179,8 @@
* produced by {@link ExtractVariantAnnotations}. Records will be filtered according to the values provided to the
* {@value SNP_CALIBRATION_SENSITIVITY_THRESHOLD_LONG_NAME} and {@value INDEL_CALIBRATION_SENSITIVITY_THRESHOLD_LONG_NAME}
* arguments; the values below are only meant to be illustrative and should be set as appropriate for a given analysis.
* Note that the {@value MODE_LONG_NAME} arguments are made explicit here, although both SNP and INDEL modes are
* selected by default.
*
* <pre>
* gatk ScoreVariantAnnotations \
Expand Down Expand Up @@ -245,7 +246,7 @@
* See documentation in the modeling and scoring interfaces ({@link VariantAnnotationsModel} and
* {@link VariantAnnotationsScorer}, respectively), as well as the default Python IsolationForest implementation at
* {@link PythonSklearnVariantAnnotationsScorer} and
* org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.
* src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.
* </p>
*
* DEVELOPER NOTE: See documentation in {@link LabeledVariantAnnotationsWalker}.
Expand Down Expand Up @@ -282,7 +283,8 @@ public class ScoreVariantAnnotations extends LabeledVariantAnnotationsWalker {
public static final String SCORES_HDF5_SUFFIX = ".scores.hdf5";

@Argument(
fullName = MODEL_PREFIX_LONG_NAME)
fullName = MODEL_PREFIX_LONG_NAME,
doc = "Prefix for model files. This should be identical to the output prefix specified in TrainVariantAnnotationsModel." )
private String modelPrefix;

@Argument(
Expand Down Expand Up @@ -595,22 +597,22 @@ private String formatDouble(final double x) {
@Override
VCFHeader constructVCFHeader(final List<String> sortedLabels) {
final VCFHeader inputHeader = getHeaderForVariants();
final Set<VCFHeaderLine> inputHeaders = inputHeader.getMetaDataInSortedOrder();
final Set<VCFHeaderLine> sortedInputHeaderMetaData = inputHeader.getMetaDataInSortedOrder();

final Set<VCFHeaderLine> hInfo = new HashSet<>(inputHeaders);
final Set<VCFHeaderLine> hInfo = new HashSet<>(sortedInputHeaderMetaData);
hInfo.add(new VCFInfoHeaderLine(scoreKey, 1, VCFHeaderLineType.Float,
"Score according to the model applied by ScoreVariantAnnotations"));
hInfo.add(new VCFInfoHeaderLine(calibrationSensitivityKey, 1, VCFHeaderLineType.Float,
String.format("Calibration sensitivity corresponding to the value of %s", scoreKey)));
hInfo.add(new VCFFilterHeaderLine(lowScoreFilterName, "Low score (corresponding to high calibration sensitivity)"));

hInfo.addAll(getDefaultToolVCFHeaderLines());
if (snpKey != null) {
hInfo.add(new VCFInfoHeaderLine(snpKey, 1, VCFHeaderLineType.Flag, "This site was considered a SNP during filtering"));
}
hInfo.addAll(sortedLabels.stream()
.map(l -> new VCFInfoHeaderLine(l, 1, VCFHeaderLineType.Flag, String.format(RESOURCE_LABEL_INFO_HEADER_LINE_FORMAT_STRING, l)))
.collect(Collectors.toList()));
hInfo.addAll(getDefaultToolVCFHeaderLines());

return new VCFHeader(hInfo, inputHeader.getGenotypeSamples());
}
Expand Down
Loading

0 comments on commit b3d6fec

Please sign in to comment.