Releases: broadinstitute/gatk
4.6.1.0
Download release: gatk-4.6.1.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.6.1.0 release:
- Modernize the aging Conda environment with up to date python dependencies. All the python tools have been updated appropriately. This will enable easier integration of new machine learning tools.
- If you use python tools outside of the docker, you must rebuild your conda environment for this release
CNNScoreVariants
has been replaced byNVScoreVariants
, a rewritten and modernized version. The python code for this tool was written by members of NVIDIA Genomics Research.- Thank you Babak Zamirai, Ankit Sethia, Mehrzad Samadi, George Vacek and the whole NVIDIA genomics team!
- This GATK blog post has more of the story from when we first made the tool available for testing.
- New
Funcotator
argument--prefer-mane-transcripts
which improves transcript selection and lays groundwork for upcoming improvements. - New argument
--variant-output-filtering
which lets you restrict output variants based on the input intervals. This replaces and imrpoves on--only-output-calls-starting-in-interval
and works withSelectVariants
and other VariantWalkers. This is useful to prevent duplicating variants when splitting an input VCF into multiple shards.
Full list of changes:
-
CNNScoreVariants -> NVScoreVariants (#8004, #9010, #9009)
- CNNScore variants has been replaced by NVScoreVariants, scripts that use it should be updated to use NVScoreVariants instead.
- The training tools (CNNVariantTrain, CNNVariantWriteTensors)have been removed. If you need to retrain the model for your data type you should continue to use GATK 4.6.0.0. New training tools are in development to work alongside NVScoreVariants and will be added in subsequent releases.
-
New Tools
-
Joint Calling GVS
- Adds QD and AS_QD emission from VariantAnnotator on GVS input (#8978)
-
GenomicsDB
- Switch to logging a warning instead of an exception for intervals in query that were not part of GenomicsDBImport (#8987)
-
Funcotator
- Added a '--prefer-mane-transcripts' mode that enforces MANE_Select tagged Gencode transcripts where possible )(#9012)
-
SV Calling
- Handle CTX_PP/QQ and CTX_PQ/QP CPX_TYPE values inSVConcordance (#8885)
- Complex SV intervals support by @mwalker174 (#8521)
- Require both overlap and breakend proximity for depth-only SV clustering (#8962)
-
Flow Based Calling
- Modified HaplotypeBasedVariantRecaller to support non-flow reads (#8896)
- FlowFeatureMapper: X_FILTERED_COUNT semantics adjusted and documented more accurately (#8894)
- Changes to flow arguments in haplotype caller from Picard (see Picard release notes
-
Miscellaneous Features
- Added a check for whether files can be created and executed within the configured tmp-dir (#8951)
-
Documentation
- Clarify in the README which git lfs files are required to build GATK (#8914)
- Add docs about citing GATK (#8947)
- Update Mutect2.java Documentation (#8999)
- Add more detailed conda setup instructions to the GATK README (#9001)
- Adding small warning messages to not to feed any GVCF files to these tools (#9008)
-
Refactoring
- Swapped mito mode in Mutect to use the mode argument utils (#8986)
-
Tests
-
Dependencies
Updating dependencies to make use of modern frameworks with fewer vulnerabilities was a focus of this release.-
Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. (#8561)
-
Rebuild gatk-base docker image (3.3.1) in order to pull in recent patches (#9005)
-
Updates to java build and dependencies (#8998, #9006, #9016)
- Update to the Gralde 8.10.2
- Improvements to
build.gradle
to use of features like consuming publishes Bills of Materials (BOMs) - Update many direct and transitive java dependencies to fix security vulnerabilities.
- Update Htsjdk 4.1.1 to 4.1.3
- Update Picard 3.2.0 to 3.3.0
- Update hdf5-java-bindings to version 1.2.0-hdf5_2.11.0 (#8908)
-
4.6.0.0
Download release: gatk-4.6.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.6.0.0 release:
-
We've fixed a serious CRAM writing bug that affects GATK versions 4.3 through 4.5 and Picard versions 2.27.3 through 3.1.1. This bug can, in limited cases, lead to reads with an incorrect base sequence being written. See this comment to GATK issue 8768 and the full release notes below for more details on what conditions trigger the bug.
- To help users detect whether their CRAM files are affected, we've released a CRAM scanning tool called
CRAMIssue8768Detector
that can detect whether a particular CRAM file is affected by this bug. If you suspect that some of your CRAM files may have been affected, please run this tool on them for confirmation!
- To help users detect whether their CRAM files are affected, we've released a CRAM scanning tool called
-
By overwhelming popular demand, we've switched back to using the standard
./.
representation for no-calls inGenotypeGVCFs
andGenomicsDB
instead of0/0
withDP=0
. This reverts the change described in our article GenotypeGVCFs and the death of the dot.- We intend to publish a new article shortly to replace that older article with further details on this change. When we do so, we'll link to it from here.
-
The
Mutect2
germline resource can now have split multiallelic format -
Added an
--inverted-read-filter
argument to allow for selecting reads that fail read filters from the command line easily -
We've fixed a number of issues with HTTP support, mainly affecting the loading of side inputs such as indices over HTTP
-
Reduced the number of layers in the GATK docker image to help users running into docker quota issues
Full list of changes:
-
Important CRAM writing bug fix and detection tool
- We've updated to
HTSJDK
4.1.1 andPicard
3.2.0 (#8900), which fix a serious bug in the CRAM writing code first reported in GATK issue 8768 - This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0.
- This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.
- The bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met:
- At least one read is mapped to the very first base of a reference contig
- The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig
- When both of these conditions are met, the resulting CRAM file may have corrupt containers associated with that contig containing reads with an incorrect sequence.
- Since many common references such as hg38 have N's at the very beginning of the autosomes and X/Y, many pipelines will not be affected by this bug. However, users of a telomere-to-telomere reference, users doing mitochondrial calling, and users with reads aligned to the alt sequences will want to scan their CRAM files for possible corruption.
- The other mitigating circumstance is that when a CRAM is affected, the signal will be overwhelmingly obvious, with the mismatch rate typically jumping from sub-1% to 80-90% for the affected regions, making it likely to be caught by standard QC processes.
- We've released a CRAM scanning tool called
CRAMIssue8768Detector
(#8819) that can detect whether a particular CRAM file is affected by this bug. If you suspect that some of your CRAM files may have been affected, please run this tool on them for confirmation!
- We've updated to
-
Joint Calling
- We've switched back to using the standard
./.
representation for no-calls inGenotypeGVCFs
andGenomicsDB
instead of0/0
withDP=0
(#8715) (#8741) (#8759)- This reverts the change described in our article GenotypeGVCFs and the death of the dot
- Fix for
GenotypeGVCFs
with mixed ploidy sites (#8862) - Fix for
GnarlyGenotyper
when PLs are null (#8878) - Fixed bug in
ReblockGVCF
when removing annotations (#8870) - Enable
ReblockGVCF
to subset AS annotations that aren't "raw" (pipe-delimited) (#8771) - Remove header lines in
ReblockGVCF
when we remove FORMAT annotations (#8895) ReblockGVCF
: Add malaria spanning deletion exception regression test with fix (#8802)- Restore some
GnarlyGenotyper
tests (#8893)
- We've switched back to using the standard
-
HaplotypeCaller
- Fix to long deletions that overhang into the assembly window causing exceptions in
HaplotypeCaller
(#8731)
- Fix to long deletions that overhang into the assembly window causing exceptions in
-
Mutect2
- The
Mutect2
germline resource can now have split multiallelic format (#8837) - Make the
Mutect2
haplotype and clustered events filters smarter about germline events (#8717) - Added the DragSTR model to the Mutect2 WDL (#8716)
- Improvements to
Mutect2
'sPermutect
training data mode (#8663) - Bigger
Permutect
tensors andPermutect
test datasets can be annotated with truth VCF (#8836) Mutect2
WDL and GetSampleName can handle multiple sample names in BAM headers (#8859)Permutect
dataset engine outputs contig and read group indices, not names (#8860)- Normal artifact LOD is now defined without the extra minus sign (#8668)
- The
-
CNV Calling
- Fixed the GT header in
PostprocessGermlineCNVCalls
's--output-genotyped-intervals
output (#8621)
- Fixed the GT header in
-
SV Calling
-
Flow-based Calling
-
Notable Enhancements
- Added an
--inverted-read-filter
argument to allow for selecting reads that fail read filters from the command line easily (#8724) - Inverted
SoftClippedReadFilter
to conform to the standard filtering logic (#8888) - Reduced the number of docker layers in the GATK image from 44 to 16 (#8808)
VariantFiltration
: added a--mask-description
argument to write custom mask filter description in VCF header (#8831)GatherVcfsCloud
is no longer beta (#8680)
- Added an
-
Miscellaneous Changes
GetPileupSummaries
now uses the standardMappingQualityReadFilter
instead of a custom--min-mapping-quality
argument (#8781)Funcotator
: suppress a log message about b37 contigs when not doing b37/hg19 conversion (#8758)- Output the new image name at the end of a successful cloud docker build (#8627)
- Exclude the test folder from code coverage calculations (#8744)
- Removed deprecated genomes in the cloud docker image that was causing CNN WDL test failures (#8891)
- Re-commit large test files as lfs stubs (#8769)
- Standardize test results directory between normal/docker tests (#8718)
- Improve failure message in
VariantContextTestUtils
(#8725) - Update the
setup_cloud
github action (#8651) - Parameterize the logging frequency for ProgressLogger in
GatherVcfsCloud
(#8662)
-
Documentation
- Updated the README to include list of popular software included in docker image (#8745)
-
Dependencies
- Updated
HTSJDK
to 4.1.1, which fixes the CRAM writing bug described above (#8900) - Updated
Picard
to 3.2.0, which fixes the CRAM writing bug described above (#8900) - Updated
GenomicsDB
to 1.5.3, which supports M1 Macs and switches no-call representation back to./.
(#8710) (#8759) - Updated
http-nio
to 1.1.1, which fixes several URL-handling bugs with HTTP support (#8889) - Updated several miscellaneous dependencies to fix security vulnerabilities (#8898)
- Updated
4.5.0.0
Download release: gatk-4.5.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.5.0.0 release:
-
HaplotypeCaller
now supports custom ploidy regions that can be specified via a new--ploidy-regions
argument, overriding the global-ploidy
setting -
The default
SmithWaterman
implementation forHaplotypeCaller
andMutect2
is now the hardware-accelerated version, resulting in a significant speedup -
Funcotator
has a new datasource release that brings in the latest version ofGencode
and several other key data sources -
We've updated our dependencies and our docker environment to greatly cut down on known security vulnerabilities
-
We've greatly improved support for
http
/https
inputs in GATK-native tools (though most Picard tools bundled with GATK do not yet support it) -
We've ported some additional DRAGEN features to
HaplotypeCaller
that bring us closer to functional equivalence with DRAGEN v3.7.8 -
GenomicsDBImport
now has support for Azure storageaz://
URIs -
GnarlyGenotyper
now has haploid support -
Lots of important bug fixes, including a fix for a bug in the Intel GKL that could cause output files to intermittently fail to be compressed properly
Full list of changes:
-
HaplotypeCaller
- HaplotypeCaller now supports custom ploidy regions (#8609)
- Added a new argument to
HaplotypeCaller
called--ploidy-regions
which allows the user to input a.bed
or.interval_list
with the "name" column equal to a positive integer for the ploidy to use when calling variants in that region - The main use case is for calling haploid variants outside the PAR for XY individuals as required by the VCF spec, but this provides a much more flexible interface for other similar niche applications, like genotyping individuals with other known aneuploidies
- The global
-ploidy
flag will still provide the background default (or the built-in ploidy of 2 for humans), but the user-supplied values will supersede these in overlapping regions
- Added a new argument to
- Changed the
SmithWaterman
implementation to default toFASTEST_AVAILABLE
(#8485) - Fixed a bug in pileup calling mode relating to the number of haplotypes (#8489)
- Huge simplication of genotyping likelihoods calculations -- no change in output (#6351)
- Be explicit about when variants are biallelic (#8332)
- Fixed debug log severity for read threading assembler messages (#8419)
- Fixed issue with visibility of the
--dont-use-softclipped-bases
argument (#8271)
- HaplotypeCaller now supports custom ploidy regions (#8609)
-
Mutect2
- Added a
--base-qual-correction-factor
to allow a scale factor to be provided to modify the base qualities reported by the sequencer and used in theMutect2
substitution error model (#8447)- Set to zero to turn off the error model changes introduced in GATK 4.1.9.0
- Fixed a bug in
FilterMutectCalls
for GVCFs (#8458)- When using GVCFs with
Mutect2
(for example with the Mitochondria mode), in the filtering step ADs for symbolic alleles are set to 0 so it doesn't contribute to overall AD. There was an off-by-one error that removed the alt allele AD rather than the<NON_REF>
allele AD. This led to NaNs and errors when a site had no ref reads (for example a GT of[ref,alt,<NON_REF>]
and AD of[0,300,0]
would accidentally be changed to an AD of[0,0,0]
if the alt index was removed instead of the<NON_REF>
index).
- When using GVCFs with
- Added a
-
DRAGEN-GATK
- Added implementations of the "columnwise detection" and "PDHMM" (partially-determined HMM) features from DRAGEN to bring us much closer to functional equivalence with DRAGEN v3.7.8 (#8083)
- Development work to prepare the way for the final missing DRAGEN 3.7.8 feature, "joint detection":
- Graph method for PDHMM event groups that unifies finding/merging and overlap/mutual exclusion (#8366)
- Rewrote haplotype construction methods in
PartiallyDeterminedHaplotypeComputationEngine
(#8367) - More refactoring in
PartiallyDeterminedHaplotypeComputationEngine
and preparing for joint detection (#8492) - Innocuous housekeeping changes in the partially-determined haplotypes code (#8361)
- Clarify cryptic bitwise operations in the partially-determined haplotype
EventGroup
subclass (#8400)
-
Joint Calling
- Added haploid support to
GnarlyGenotyper
(#7750) - Fix to allow
GenotypeGVCFs
to properly handle events not in minimal representation (#8567) ReblockGVCF
: added a--keep-site-filters
argument to keep site-level filters (#8304) (#8308)ReblockGVCF
: added a--add-site-filters-to-genotype
argument to move site-level filters to genotype-level filters (#8484)ReblockGVCF
: added a--format-annotations-to-remove
argument to specify format-level annotations to remove from all genotypes in final GVCF (#8411)ReblockGVCF
: added a check to make sure the input VCF is a GVCF rather than a single sample VCF (#8411)- Improved an error message in
GnarlyGenotyper
(#8270) - Added a
mergeWithRemapping()
method inReferenceConfidenceVariantContextMerger
to perform allele remapping prior to genotyping (#8318) - GVS (Genomic Variant Store) development:
- Added haploid support to
-
GenomicsDB
-
Funcotator
- New data source release V1.8 (#8512)
- Updated
Gencode
to version 43, and also updatedCOSMIC
,Clinvar
, and several other datasources to their latest versions - The data sources are now split by reference into separate hg19 and hg38 bundles to cut down on size
- Updated
- Fixed support for newer
Gencode
GTF versions by making theGencodeGTFField
parsing more permissive (#8351) - Fixed
Funcotator
VCF output renderer to correctly preserve B37 contig names on output for B37 aligned files (#8539) - Fix bug in VCF comparison code that causes
Funcotator
to crash with certain datasources (#8445) - Connected the splice site window size to CLI parameters (#8463)
- Allow
LocatableXsvFuncotationFactory
to read gzipped files (#8363)
- New data source release V1.8 (#8512)
-
CNV Calling
-
SV Calling
- Added support for breakend replacement alleles in
SVCluster
(#8408)- Implements allele collapsing for "breakend replacement" BND alleles, as described in section 5.4 of the VCFv4.2 spec
- Size similarity linkage and bug fixes for SV matching tools (#8257)
- Added size similarity criterion to the
SVConcordance
andSVCluster
tools. This is particularly useful for accurately matching smaller SVs that have a high degree of breakpoint uncertainty, in which case reciprocal overlap does not work well. PESR/mixed variant types must have size similarity, reciprocal overlap, and breakend window criteria met. Depth-only variants may have either size similarity + reciprocal overlap OR breakend window criteria met (or both).
- Added size similarity criterion to the
- Updated SV split-read strand validation and clustering (#8378)
- Adds some flexibility to the allowed split-read strand annotations on SV records:
- Allow INS -+ strands
- Allow INV null strands
- When clustering, only require that strands match for INV/BND records
- Adds some flexibility to the allowed split-read strand annotations on SV records:
- Sample set and annotation improvements for
SVConcordance
(#8211)
- Added support for breakend replacement alleles in
-
Mitochondrial pipeline
-
Flow-based Calling
- New/updated flow-based read tools (#8579)
- Added a new
GroundTruthScorer
tool to score reads against a reference/ground truth - Updated
FlowFeatureMapper
- Added a new
- Created an
AddFlowBaseQuality
tool that writes reads from flow-based SAM/BAM/CRAM files that pass criteria to a new file while adding a base-quality attribute (BQ) (#8235) - Added an experimental tool
FlowPairHMMAlignReadsToHaplotypes
that aligns flow-based reads to set of haplotypes / templates (#8305) - Fixed an issue with reads that contain the tp tag sometimes being incorrectly identified as flow-based (#8337)
- Minor changes and fixes to flow-based annotations (#8442)
- Removed a line in
FlowBasedAnnotation
that contained a bug and thus was meaningless (#8421) - Additional annotation in FeatureMap (#8347)
- Removed unnecessary flow-based argument and option (#8342)
GroundTruthScorer
doc update (#8597)- Removed unnecessary and buggy validation check (#8580)
- New/updated flow-based read tools (#8579)
-
Notable Enhancements
- Major security fixes in our dependencies and docker environment
- Greatly improved HTTP support (#8611)
- Updated the
http-nio
library and made tweaks to HTSJDK to make it available in more places. The new version ofhttp-nio
should provide much more reliable access to http(s) file paths. This is supported by all methods accessing Paths, and includes SAM/BAM/CRAM and VCF/Feature file...
- Updated the
4.4.0.0
Download release: gatk-4.4.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.4.0.0 release:
-
We've moved to Java 17, the latest long-term support (LTS) Java release, for building and running GATK! Previously we required Java 8, which is now end-of-life.
- Newer non-LTS Java releases such as Java 18 or Java 19 may work as well, but since they are untested by us we only officially support running with Java 17.
-
Significant enhancements to
SelectVariants
, including arguments to enableGVCF
filtering support and to work with genotype fields more easily. -
A new tool
SVConcordance
, that calculates SV genotype concordance between an "evaluation" VCF and a "truth" VCF -
Bug fixes and enhancements to the support for the Ultima Genomics flow-based sequencing platform introduced in GATK 4.3.0.0
Full list of changes:
-
Flow-based Variant Calling
FlowFeatureMapper
: added surrounding-median-quality-size feature (#8222)- Removed hardcoded limit on max homopolymer call (#8088)
- Fixed bug in dynamic read disqualification (#8171)
- Fixed a bug in the parsing of the T0 tag (#8185)
- Updated flow-based calling
Mutect2
parameters to make them consistent with theHaplotypeCaller
parameters (#8186)
-
SelectVariants
- Enabled GVCF type filtering support in
SelectVariants
(#7193)- Added an optional argument
--ignore-non-ref-in-types
to support correct handling of VariantContexts that contain a NON_REF allele. This is necessary because every variant in a GVCF file would otherwise be assigned the type MIXED, which makes it impossible to filter for e.g. SNPs. - Note that this only enables correct handling of GVCF input. The filtered output files are VCF (not GVCF) files, since reference blocks are not extended when a variant is filtered out.
- Added an optional argument
SelectVariants
: added new arguments for controlling genotype JEXL filtering (#8092)-select-genotype
: with this new genotype-specific JEXL argument, we support easily filtering by genotype fields with expressions like 'GQ > 0', where the behavior in the multi-sample case is 'GQ > 0' in at least one sample. It's still possible to manually access genotype fields using the old-select
argument and expressions such asvc.getGenotype('NA12878').getGQ() > 0
.--apply-jexl-filters-first
: This flag is provided to allow the user to do JEXL filtering before subsetting the format fields, in particular the case where the filtering is done on INFO fields only, which may improve speed when working with a large cohort VCF that contains genotypes for thousands of samples.
- Enabled GVCF type filtering support in
-
SV Calling
-
Notable Enhancements
GenotypeGVCFs
: added an--keep-specific-combined-raw-annotation
argument to keep specified raw annotations (#7996)VariantAnnotator
now warns instead of fails when the variant contains too many alleles (#8075)- Read filters now output total reads processed in addition to the number of reads filtered (#7947)
- Added
GenomicsDB
arguments to theCreateSomaticPanelOfNormals
tool (#6746) - Added a
DeprecatedFeature
annotation and a process for officially marking GATK tools as deprecated (#8100) - Prevent tool
close()
methods from hiding underlying errors (#7764)
-
Bug Fixes
- Fixed issue causing
VariantRecalibrator
to sometimes fail if user provided duplicate -an options (#8227) ReblockGVCF
: remove A,R, and G length attributes whenReblockGVCF
subsets an allele (#8209)- Previously if an input gVCF had allele length, reference length, or genotype length annotations in the FORMAT field,
ReblockGVCF
would not remove all of them at sites where an allele was dropped. This makes the output gVCF invalid since the annotation length no longer matches the length described in the header at those sites. Now we fix up F1R2, F2R1, and AF annotations and remove any other annotations that are not already handled that are defined as A, R, or G length in the header.
- Previously if an input gVCF had allele length, reference length, or genotype length annotations in the FORMAT field,
- Fixed a
gCNV
bug that breaks the inference when only 2 intervals are provided (#8180) - Fixed NPE from unintialized logger in
GenotypingEngine
(#8159) - Fixed asynchronous Python exception propagation in
StreamingPythonExecutor
/CNNScoreVariants
(#7402) - Fixed issue in
ShiftFasta
where the interval list output was never written (#8070) - Bugfix for the type of some output files in the somatic CNV WDL (#6735) (#8130)
MergeAnnotatedRegions
now requires a reference as asserted in its documentation (#8067)
- Fixed issue causing
-
Miscellaneous Changes
- Deprecated an untested
VariantRecalibrator
argument and an oldReblockGVCF
argument that produced invalid GVCFs (#8140) - Removed old
GnarlyGenotyper
code with a diploid assumption to prepare for adding haploid support toGnarlyGenotyper
(#8140) ReblockGVCF
: add error message for when tree-score-threshold is set but the TREE_SCORE annotation is not present (#8218)TransferReadTags
: allow empty unaligned bams as input (#8198)- Refactored
JointVcfFiltering
WDL and expanded tests. (#8074) - Updated the carrot github action workflow to the most recent version, which supports using
#carrot_pr
to trigger branch vs master comparison runs (#8084) - Replaced uses of
File.createTempFile()
withIOUtils.createTempFile()
to ensure that temp files are deleted on shutdown (#6780) - Don't require python just to instantiate the
CNNScoreVariants
tool classes. (#8128) - Made several
Funcotator
methods and fields protected so it is easier to extend the tool (#8124) (#8166) - Test for presence of ack result message and simplify
ProcessControllerAckResult
API (#7816) - Fixed the path reported by the gatkbot when there are test failures (#8069)
- Fixed incorrect boolean value in
DirichletAlleleDepthAndFractionIntegrationTest
(#7963) - Removed two ancient and unused
HaplotypeCaller
test files that are no longer needed (#7634) - Added scattered gCNV case WDL to dockstore file (#8217)
- Deprecated an untested
-
Documentation
- Updated instructions for installing Java in the README (#8089)
- Added documentation on
OMP_NUM_THREADS
andMKL_NUM_THREADS
toGermlineCNVCaller
andDetermineGermlineContigPloidy
(#8223) - Improvements to
PileupDetectionArgumentCollection
documentation (#8050) - Fixed typo in documentation for
VariantAnnotator
(#8145)
-
Dependencies
- Moved to
Java 17
, the latest LTS Java release, for building/running GATK (#8035) - Updated
Gradle
to 7.5.1 (#8098) - Updated the GATK base docker image to 3.0.0 (#8228)
- Updated
HTSJDK
to 3.0.5 (#8035) - Updated
Picard
to 3.0.0 (#8035) - Updated
Barclay
to 5.0.0 (#8035) - Updated
GenomicsDB
to 1.4.4 (#7978) - Updated
Spark
to 3.3.1 (#8035) - Updated
Hadoop
to 3.3.1. (#8102) - Require
commons-text
1.10.0 to fix a security vulnerability (#8071)
- Moved to
4.3.0.0
Download release: gatk-4.3.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.3.0.0 release:
-
Support for the Ultima Genomics flow-based sequencing platform
-
A next-generation suite of tools for variant filtration based on site-level annotation, intended to eventually supersede the older
VariantRecalibrator
workflow -
CompareReferences
andCheckReferenceCompatibility
: new tools for comparing and checking compatibility with genomic references -
Support in
HaplotypeCaller
/Mutect2
for supplementing the variants discovered in local assembly with variants discovered via a pileup-based approach
Full list of changes:
-
Support for the Ultima Genomics flow-based sequencing platform (#7876)
- Added a new
--flow-mode
argument toHaplotypeCaller
which better supports flow-based calling- Added a new Haplotype Filtering step after assembly which removes suspicious haplotypes from the genotyper
- Added two new likelihoods models,
FlowBasedHMM
and theFlowBasedAlignmentLkelihoodEngine
- Added a new
--flow-mode
argument toMutect2
which better supports flow-based calling - Added support for uncertain read end-positions in
MarkDuplicatesSpark
- Added a new tool
FlowFeatureMapper
for quick heuristic calling of bams for diagnostics - Added a new tool
GroundTruthReadsBuilder
to generate ground truth files for Basecalling - Added a new diagnostic tool
HaplotypeBasedVariantRecaller
for recalling VCF files using theHaplotypeCallerEngine
- Added a new tool breaking up CRAM files by their blocks,
SplitCram
- Added a new read interface called
FlowBasedRead
that manages the new features for FlowBased data - Added a number of flow-specific read filters
- Added a number of flow-specific variant annotations
- Added support for read annotation-clipping as part of clipreads and GATKRead
- Added a new
PartialReadsWalker
that supports terminating before traversal is finished
- Added a new
-
Next-generation suite of tools for variant filtration based on site-level annotations (#7954) (#8049)
- This tool suite is intended to eventually supersede the older
VariantRecalibrator
workflow - The new tools include:
ExtractVariantAnnotations
: extracts site-level variant annotations, labels, and other metadata from a VCF file to HDF5 filesTrainVariantAnnotationsModel
: trains a model for scoring variant calls based on site-level annotationsScoreVariantAnnotations
: scores variant calls in a VCF file based on site-level annotations using a previously trained model
- This tool suite is intended to eventually supersede the older
-
New Reference Comparison Tools
CompareReferences
: a new tool for analyzing the differences between references at both the dictionary and the base level (#7930) (#7987) (#7973)- In its default mode, this tool uses the reference dictionaries to generate an MD5-keyed table comparing the specified references, and does an analysis to summarize the differences between the references provided.
- Comparisons are made against a "primary" reference, specified with the
-R
argument. Subsequent references to be compared may be specified using the ``--references-to-compare` argument. - A supplementary table keyed by sequence name can be displayed using the
--display-sequences-by-name argument
; to display only sequence names for which the references are not consistent, run with the--display-only-differing-sequences
argument as well. - MD5s can be recalculated from the actual sequence when missing from the dictionary
- When run with
--base-comparison FULL_ALIGNMENT
, the tool performs full-sequence alignment on the differing reference sequences to produce a VCF with SNPs and Indels. However, this mode ignores IUPAC / N bases. - Running with
--base-comparison FIND_SNPS_ONLY
finds single-base differences between differing reference sequences of the same length. This mode can handle IUPAC / N bases correctly, but not indels. - To perform the full-sequence alignment, GATK now packages a distribution of
MUMmer
for x86_64 Mac and Linux, which can be invoked from within the GATK using the newMummerExecutor
class.
CheckReferenceCompatibility
: a new tool to check a BAM/CRAM/VCF for compatibility against a set of references (#7959) (#7973)- This tool generates a table analyzing the compatibility of a BAM/CRAM/VCF input file against provided references.
- The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the
--references-to-compare
argument. - When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length.
-
HaplotypeCaller/Mutect2
- Added an optional "Pileup Detection" step to
Mutect2
andHaplotypeCaller
before assembly that supplements the variants from local assembly with variants that show up in the pileups (#7432) - Fixed a
Mutect2
IndexOutOfBoundException
with germline resource (#7979) Mutect3
dataset enhancements: optional truth VCF for labels, seq error likelihood annotation (#7975)- Added
Mutect3
dataset generation to theMutect2
WDL (#7992) GetPileupSummaries
now streams its output rather than storing it in memory (#7664)- Fixed a rare edge case in the
AdaptiveChainPruner
where theJavaPriorityQueue
is undefined for tied elements (#7851)
- Added an optional "Pileup Detection" step to
-
SV Calling
CondenseDepthEvidence
: a new tool that combines adjacent intervals in DepthEvidence files (#7926)LocusDepthtoBAF
: a new tool that merges locus-sorted LocusDepth evidence files, calculates the bi-allelic frequency (baf) for each sample and site, and writes these values as a BafEvidence output file (#7776)PrintReadCounts
: a new tool that prints (and optionally subsets) an read depth (DepthEvidence) file or a counts file as one or more (for multi-sample DepthEvidence files) counts files for CNV determination (#8015)CollectSVEvidence
: fixed a bug where trailing SNP sites and depth intervals without read coverage were being omitted from the output (#8045)CollectSVEvidence
: added read depth generation and raw-counts output (#8015)- Improved
PrintSVEvidence
performance by tweaking theMultiFeatureWalker
traversal (#7869) - Fixes related to
BafEvidence
(biallelic-frequency of a sample at some locus) (#7861) - Fixed a bug where the end coordinate was being incorrectly compared when sorting discordant read pair evidence (#7835)
- Sort output from
SVClusterEngine
(#7779) - Remove abandoned SV filtering project and unneeded build dependency (#7950)
-
CNV Calling
-
GenomicsDB
GenomicsDBImport
: added the ability to specify explicit index locations via the sample name map file (#7967)- Each line in the sample name map file may now optionally contain a third column with the path/URI to the index. This is useful when the index is not in the same location as the corresponding GVCF.
-
Bug Fixes
- Fixed an issue where we weren't properly merging AD values when combining GVCFs and no PLs were present (#7836)
- Fixed a bug in
ReblockGVCF
that could cause the first position on a contig to be dropped (#8028) - Fixed an allele-ordering issue in the allele-specific annotation code (#7585)
VariantRecalibrator
: type change int -> long to prevent tranche novel variant count overflow (#7864)- Fixed an issue with tabix index generation (#7858)
- Fixed a bug in
SiteDepthCodec
(#7910)
-
Miscellaneous Changes
VariantsToTable
now includes all fields when none are specified (#7911)SelectVariants
now warns the user about poor performance when the sample names in the VCF header are unsorted (#7887)VariantRecalibrator
now has a--dont-run-rscript
argument to disable execution of its R script but still output the actual R script file (#7900)- Added some generic read tag/expression filters for use on numeric tags (#7746)
- Replaced Travis CI with Github Actions for our continuous testing (#7754)
- Switched over to Github Actions for building our nightly docker image (#7775)
- Created a new
build_docker_remote.sh
script for building the docker image remotely with Google Cloud Build (#7951) - Added an argument mode manager for group arguments and a demonstration of how it might be used in
HaplotypeCaller
--dragen-mode
(#7745) - Added unit tests for the
Utils.concat()
methods (#7918) - Added a test to validate WDLs in the scripts directory. (#7826)
- Added a
use_allele_specific_annotation
arg and fixed task with empty input in theJointVcfFiltering
WDL (#8027) - Fixed an issue in the GATK stats script in which the first day's downloads on a new release were set to 0 (#7794)
- Fixed a typo in the Dockerfile that broke git lfs pull (#7806)
- Removed unused code in the
utils.solver
package (#7922) - Corrected the time for GATK nightly build cron jobs (#7784)
- Disabled the red "X" from failing
CodeCov
builds and de...
4.2.6.1
Download release: gatk-4.2.6.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.6.1 release:
This release contains a single bug fix for GenotypeGVCFs
to fix an erroneous IllegalStateException
("No likelihood sum exceeded zero -- method was called for variant data with no variant information.") in the edge case where unnormalized PLs are present at monomorphic sites.
4.2.6.0
Download release: gatk-4.2.6.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.6.0 release:
-
Important bug fixes for the joint calling tools (GenotypeGVCFs / GenomicsDB)
- GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
GenotypeGVCFs
can throw NullPointerExceptions in some cases with many alternate alleles.- The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
- If you are running these tools in 4.2.5.0 we strongly recommend updating to 4.2.6.0
- GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
-
Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when the
--gcs-project-for-requester-pays
argument was specified- If you continue to encounter problems accessing requester pays Google Cloud Storage buckets in 4.2.6.0, please let us know by filing a Github issue!
-
Two new tools for the Structural Variation calling pipeline:
SVAnnotate
andPrintSVEvidence
-
Some fixes to genotype-given-alleles mode in
HaplotypeCaller
andMutect2
Full list of changes:
-
Joint Calling (GenotypeGVCFs / GenomicsDB)
- GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
GenotypeGVCFs
can throw NullPointerExceptions in some cases with many alternate alleles.- Fixed in:
- Fix for
NullPointerException
when GenomicsDB has more ALT alleles than specified maximum and many GQ0 hom-ref genotypes allow variants to pass the QUAL filter (#7738)
- Fix for
- Fixed in:
- The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
- Fixed in:
- Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in
ReblockGVCFs
(#7670)
- Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in
- Fixed in:
- Mention acceptable compressed VCF file extensions in
GenomicsDBImport
error message (#7692)
- GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
-
SV Calling
- Added a new tool
SVAnnotate
(#7431)SVAnnotate
adds functional annotations for SVs called byGATK-SV
(#7431)
- Added a new tool
PrintSVEvidence
(#7695)PrintSVEvidence
is a tool that can merge any number of files containing one of five types of evidence of structural variation. It's also capable of subsetting regions or samples. It's used to merge evidence from a cohort in theGATK-SV
pipeline.
- Added start/end coordinate validation to
SVCallRecord
(#7714)
- Added a new tool
-
HaplotypeCaller / Mutect2
- Fixed an edge case in
HaplotypeCaller
where filtered alleles in the vicinity of forced-calling alleles could result in empty calls (#7740)- This affects users who run genotype given alleles mode in non-GVCF mode
- Fixed a bug in
HaplotypeCaller
andMutect2
where force-calling alleles were lost upon trimming by placing allele injection after trimming (#7679) - Added a debug ``--pair-hmm-results-file` argument that dumps the the exact inputs/outputs of the PairHMM to a file (#7660)
- Some changes to
Mutect2
to support the futureMutect3
(#7663)- Added training data for the Mutect3 normal artifact filter
- Output tensors for Mutect3 as plain text rather than VCF
- Fixed an edge case in
-
RNA Tools
TransferReadTags
: a new tool that transfers a read tag from an unaligned bam to the matching aligned bam (#7739).- This tool allows us to retrieve read tags that get lost when converting a SAM file to fastqs, then back to SAM (which is necessary if e.g. running fastp to clip adapter bases before alignment).
PostProcessReadsForRSEM
: a new tool that re-orders and filters reads before running RSEM, which has stringent requirements on the input SAM (https://github.com/deweylab/RSEM) (#7752).
-
Funcotator
- Added custom
VariantClassification
severity ordering. (#7673)- Users can now customize the severity ratings of the various
VariantClassifications
using the new--custom-variant-classification-order
argument
- Users can now customize the severity ratings of the various
- Added logging statements to the b37 conversion process explaining why the automatic b37 conversion does or does not take place on their VCFs (#7760)
- Added custom
-
VariantRecalibrator
- Added regularization to covariance in GMM maximization step to fix convergence issues in
VariantRecalibrator
(#7709)- This makes the tool more robust in cases where annotations are highly correlated
- Added regularization to covariance in GMM maximization step to fix convergence issues in
-
Bug Fixes
- Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when
--gcs-project-for-requester-pays
was specified (#7700) (#7730) - Fix for the
PossibleDeNovo
annotation to work without Genotype Likelihoods (#7662)PossibleDeNovo
checks each trio's genotype (including parent hom ref genotypes) for likelihoods even though it doesn't actually use the PLs. The PLs can get dropped if GVCFs are reblocked which means this annotation no longer works as expected. This changes the check to look for GQs instead of PLs as the GQs are used as part of the annotation.
- Fixed a bug with the
--mate-too-distant-length
inMateDistantReadFilter
not being configurable (#7701)
- Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when
-
GATK Engine
-
Miscellaneous Changes
- Added back the
jcenter
repository resolver to our gradle build, fixing a "Could not find biz.k11i:xgboost-predictor:0.3.0" error when building GATK from source (#7665) - We now properly update the
latest
tag in thebroadinstitute/gatk-nightly
Dockerhub repo (#7703) - The docker build now only does a
git lfs pull
onsrc/main/resources/large
(#7727) - Install git lfs with --force in the
Dockerfile
(#7682) - Fix WDL generation for
MultiVariantWalkers
by adding a companion index to theMultiVariantWalker
input variant arg (#7689) - Added google apps script to automatically update GATK release stats. (#7637)
- Updated the GATK stats script to be more universally usable (#7759)
- Added
JointCallExomeCNVs
to.dockstore.yml
and included a note in the WDL (#7719)
- Added back the
-
Documentation
- Corrected the docs for the
--heterozygosity
argument in theGenotypeCalculationArgumentCollection
(#7661)
- Corrected the docs for the
-
Dependencies
4.2.5.0
Download release: gatk-4.2.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.5.0 release:
-
Fixed a
GenotypeGVCFs
IllegalStateException
error reported by multiple users in #7639 -
Added a new tool
SVCluster
that clusters structural variants based on coordinates, event type, and supporting algorithms.
Full list of changes:
-
Joint Calling (GenotypeGVCFs / GenomicsDB)
- Fixed an
IllegalStateException
inGenotypeGVCFs
arising from GenomicsDB output with too many alts and no likelihoods, and also added a--genomicsdb-max-alternate-alleles
argument that is separate from the--max-alternate-alleles
argument used byGenotypeGVCFs
(#7655)- This fixes the
GenotypeGVCFs
error reported in #7639 - The new
--genomicsdb-max-alternate-alleles
argument is required to be at least one greater than the--max-alternate-alleles
argument, to account for the NON_REF allele.
- This fixes the
ReblockGVCF
: fixed an edge case where hom-ref "variant" records with no data had wrong-sized PLs and didn't merge with adjacent blocks (#7644)
- Fixed an
-
SV Calling
- Added a new tool
SVCluster
that clusters structural variants based on coordinates, event type, and supporting algorithms. (#7541)- Primary use cases include:
- Clustering SVs produced by multiple callers, based on interval overlap, breakpoint proximity, and sample overlap.
- Merging multiple SV VCFs with disjoint sets of samples and/or variants.
- Defragmentation of copy number variants produced with depth-based callers.
- Primary use cases include:
- Added a new tool
-
Mutect2
-
GATK Engine
- Added a new read filter,
ExcessiveEndClippedReadFilter
(#7638)- This filter will keep reads that have fewer than the specified number of clipped bases on either end.
- Designed with long reads in mind, and as a result has a default value of 1000.
- Added a new read filter,
4.2.4.1 the log4j strikes back
Download release: gatk-4.2.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.4.1 release:
- Fix more newly discovered log4j2 vulnerabilities. Now that people are paying attention they are finding all sorts of things.
Full list of changes:
-
Build System
- Upgrade our build from Gradle 5.6 to the newest 7.3.2 (#7609)
- This fixes some gradle bugs which were blocking development
-
GenomicsDB
-
Miscellaneous Changes
-
Dependencies
4.2.4.0 the log4shell edition
Download release: gatk-4.2.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.4.0 release:
- Fix a major security bug due to log4j vulnerability. (CVE-2021-44228)
- Improvement to calculation of ExcessHet in joint genotyping. (GenotypeGVCFs, GnarlyGenotyper, ExcessHet).
Full list of changes:
-
Funcotator
- Aligned the Funcotator checkIfAlreadyAnnotated test with the Funcotator engine code. (#7555)
-
GenotypeGVCFs / ExcessHet
- Removed undocumented mid-p correction to p-values in exact test of Hardy-Weinberg equilibrium and updated corresponding tests. We now report the same value as ExcHet in bcftools. Note that previous values of 3.0103 (corresponding to mid-p values of 0.5) will now be 0.0000. (#7394)
- Updated expected ExcessHet values in integration test resources and added an update toggle to GnarlyGenotyperIntegrationTest.
- Updated ExcessHet documentation.
-
Miscellaneous Changes
-
Documentation
-
Dependencies
- Updated log4j to version 2.13.1 -> 2.16.0 to patch CVE-2021-44228 (#7605)