Releases: etal/cnvkit
Version 0.7.1
This is primarily a bugfix release. Many more unit test cases were added to the automated test suite. Code coverage is now monitored at Codecov (thanks @stevepeak).
export nexus-basic
:
- New optional argument
-v
/--vcf
extracts SNV b-allele frequencies from the given VCF file, matches them to the bins in the .cnr file, and prints an additional "baf" column in the output table. These allele frequencies can then be viewed in Nexus Copy Number, similar to a SNP array.
call
:
- Fixed a bug in the
threshold
method where the copy number of haploid chromosomes was twice what it should be. Theclonal
method already handled these chromosomes properly. (#49)
reference
:
- Handle blank/empty antitarget BED and coverage (.cnn) files. This was a regression from earlier releases in v0.7.0. (#51)
- When calculating GC and RepeatMasker values, catch invalid BED ranges that extend beyond the length of the chromosome and raise an informative error. This would error before, too (in ngfrills.faidx), but the message would be baffling.
fix
:
- Catch duplicated target ranges, e.g. the exact same bait labeled with two different gene names, and report those ranges in the error message. The
target
command's--split
option should usually fix these, but sometimes it's not used.
Version 0.7.0
CNVkit now depends on pandas, SciPy, and PyVCF. The internals were largely rewritten, so please report any bugs or other regressions you find.
Documentation is much improved.
export:
- VCF format is supported (#5, #41). The generated VCFs are compatible with many third-party tools, including development versions of MetaSV. (Thanks @chapmanb)
- Removed the "freebayes" sub-command; use "export bed" instead.
segment:
- The names of genes (or other targeted loci) covered by each segment are now included in the output .cns file.
- The p-value or q-value threshold (depending on the method) can now be specified with
-t
/--threshold
. - The "haar" method works properly now (#6). This segmentation algorithm is implemented in Python and does not require R to run. It is a bit faster than CBS, but not as accurate.
loh:
- Plot variant allele frequencies (VAFs) as their actual values, 0 to 1, instead of the mirrored b-allele frequency (0.5 to 1). Draw segment mean allele frequencies separately above and below 0.5. This matches how the equivalent SNP array data are typically viewed.
antitarget:
- Generate off-target bins for all chromosomes present in the "access" BED file, not just those where targeted regions occur. (#37)
coverage:
- A minimum read mapping quality (MAPQ) value can now be specified with
-q
/--min-mapq
. The default value is 0, i.e. reads are no longer excluded for low MAPQ or ambiguous mapping location. This should generally improve calling accuracy and avoid some spurious deletion calls.
Version 0.6.1
Small fixes in segmentation, affecting the output of segment
and preventing crashes in segmetrics
:
- Exclude fewer low-coverage bins from segmentation (using a lower minimum coverage threshold).
- In case the first or last bins on a chromosome were excluded from segmentation, adjust the first and last segments on each chromosome so that their endpoints match the first and last bins.
- If no bins on a chromosome passed the coverage filter, instead of omitting the chromosome from segmentation output, generate a single segment covering the full chromosome, with segment log2 ratio 0.0. (So, all chromosomes in the .cnr file will be present in the .cns file, too.)
Version 0.6.0
Added two new commands, call
and segmetrics
, and a new export
format, BED.
segmetrics
:
- Calculates summary statistics of the residual bin-level log2 ratio estimates from the segment means, similar to the existing
metrics
command, but for each segment individually. Results are output in the same format as the CNVkit segmentation file (.cns), with the stat names and calculated values printed in the "gene" column. - Supported stats:
- standard deviation, median absolute deviation, inter-quartile range, Tukey's biweight midvariance (as in
metrics
); - confidence interval, estimated by bootstrap;
- prediction interval, estimated by the range between the 2.5-97.5 percentiles of bin-level log2 ratio values within the segment.
- standard deviation, median absolute deviation, inter-quartile range, Tukey's biweight midvariance (as in
- Thanks to @mjafin for suggesting this feature (#28).
call
:
- Given segmented log2 ratio estimates (.cns file), round the copy ratio estimates to integer values using either:
- A list of threshold log2 values for each copy number state, or
- Some algebra, given known tumor cell fraction and normal ploidy. (This was previously available through the
export freebayes
command, see below.)
- The output is another .cns file, where the values in the
log2
column are still log2-transformed, but represent integers in log2 scale -- e.g. a neutral diploid state is represented as "0.0", not the integer 2. These output files are still compatible with the other CNVkit commands that accept .cns files, and can be plotted the same way.
export bed
:
- New
bed
format supporting the same features asexport freebayes
that were not moved into thecall
command (see above). The output BED file is still compatible with the FreeBayes--cnv-map
option. In addition,export bed
has the new option--show-neutral
to also output neutral-CN segments/regions, in addition to the CNV regions output by default. - The
export freebayes
sub-command is deprecated but still available in this release; it will be removed in the next release. This command supported the tumor-purity adjustment now implemented in thecall
command. The recommended approach is to instead runcall
first on each .cns file, and thenexport bed
on all the adjusted .cns files to get an equivalent BED file compatible with FreeBayes--cnv-map
option.
Smaller changes:
gainloss
: Reduced the default log2 ratio threshold from .5 to .2import-picard
: Use the un-normalized mean coverage instead of the normalized coverage of each target as the log2 coverage values in the output .cnn file. This matches the output of thecoverage
command; CNVkit normalizes coverages later in the pipeline.- Some internal refactoring. Please report any bugs, real or perceived, on our GitHub issue tracker.
Version 0.5.1
Bug fixes for two edge cases in whole genome analyses (thanks @chapmanb):
- reference: Merging target and antitarget .cnn files where antitargets are empty
- diagram: Avoid trying to plot segements over the start or end of chromosomes
Version 0.5.0
This release includes a variety of improvements to CNVkit's calling accuracy and robustness. All CNVkit files built with previous versions will continue to work with this version, but for best results, I recommend rebuilding your reference.cnn file(s) from the targetcoverage.cnn and antitargetcoverage.cnn files.
coverage
:
- Output target/antitarget coverage (.cnn) files are no longer median-centered. Read depths in each bin are still log2-scaled, but the observed read depth can now be easily recovered from .cnn files.
reference
, fix
:
- Include a "flat pseudocount" in addition to the given normals, making paired tumor-normal calling much more robust and accurate.
- Perform bias corrections on the input normal samples before calculating the average and spread of log2 values.
fix
:
- Do bias corrections before subtracting the reference, instead of after, because the reference already includes bias corrections now.
- In addition to weighting bins by spread (which can only be observed with a pooled reference), also weight by bin size and deviation of reference log2 values in each bin from the global median. So, useful bin weights are now derived from "flat" and single-normal-sample references, too.
segment
:
- Recalculate CBS segment means using bin weights (in the R library this simply the mean, arguably a bug).
- Set CBS segment start/end positions to match the underlying bin start/end positions.
- Improved centromere detection -- only exclude one "large gap", if any, from each chromosome.
- Tuned CBS calling parameters to improve accuracy (see benchmarks in the repo etal/cnvkit-examples).
diagram
:
- Label genes using the same criteria as the
gainloss
command: if segments are given, use the segment value at each gene, otherwise calculate the weighted average of bin-level log2 values within each gene. - New option
-m
/--min-probes
to matchgainloss
. - Guess gender from chrX more reliably, so that the same gender is called from the bin-level (.cnr) and segmented (.cns) values given.
scatter
, loh
:
- When plotting allele frequencies from a VCF, if segments are given (.cns), also apply those segments to allele frequencies to show LOH regions that match CNVs.
- Skip somatic variants identified in a VCF, and try to retain only germline variants, when plotting LOH. (This is not very well standardized across callers, so please watch for bad behavior from callers other than FreeBayes and MuTect, and let me know about it!)
scatter
only: Added options--y-min
,--y-max
to set y-axis limits on the plot.- Removed the deprecated
-r
option. Use-c
instead.
The long-deprecated cbs
command has been removed. Use segment
instead.
Bugs in parsing and writing empty and 1-line VCF, BED and CNVkit files, and other VCF quirks, have now been fixed (Thanks @chapmanb!)
Version 0.4.1
New features:
scatter
command:
Option -c can now take coordinate ranges like -r, so -r is deprecated and will be removed in the next release.genome2access.py
script:
New -x option to exclude additional regions. Added a new file "data/access-5k-mappable.hg19.bed" which used this option to exclude the Encode "Duke" and "Dac" low-mappability regions.
Also:
- Improved the help/usage messages for several commands. Added a "version" command that prints the current CNVkit version. (Thanks @HenrikBengtsson)
- Tuned CBS calling parameters to improve segmentation accuracy according to some benchmarks.
- Sped up a few slow functions identified by profiling. In particular,
metrics
is much faster now. - Fixed bugs/incompatibilities in plotting commands and cleaned up the source code (Thanks @chapmanb and @roryk)
CNVkit can now be obtained and run as a Docker container:
https://registry.hub.docker.com/u/etal/cnvkit/
Version 0.4.0
New features:
- Plotting (
scatter
andloh
commands):- Support VCFs from more callers, including MuTect, VarScan and FreeBayes. Support multi-sample VCFs; the sample in the VCF can be selected by name with the
-i
option, and will also be shown as the plot title. Thanks to Brad Chapman (@chapmanb) for this contribution. (#11) - Enable highlighting of selected regions other than genes using the
-r
and-w
options. The plot title (sample ID) can also be specified with-i
/--sample-id
. Thanks to Brad Chapman (@chapmanb) for this contribution. (#9) - New
-l
/--range-list
option to plot a BED file of regions, each in its own plot, and combine the generated plots into a single multi-page PDF file. Thanks to Rory Kirchner (@roryk) for this contribution. (#21)
- Support VCFs from more callers, including MuTect, VarScan and FreeBayes. Support multi-sample VCFs; the sample in the VCF can be selected by name with the
- FreeBayes export format can now handle multiple samples (.cns files).
Changes:
- Renamed
--male-normal
option to--male-reference
(but kept-y
alias) in all commands that had it. export
options: Specify sample name with-i
/--sample-id
option instead of-n
.scatter
plotting command: added--min-variant-depth
option to matchloh
. (#10)- The
loh
plot command does not attempt significance testing anymore; we're working on a better solution. (#10, #18)
Bug fixes:
- Handle empty BED/region/interval_list files, so that an empty "antitarget" file can be used when analyzing WGS or targeted amplicon capture datasets. (#19)
- Ignore "." labels for genes, the same way we already ignore "-" labels, for better interoperability with BEDtools. Thanks to Brad Chapman (@chapmanb) for this contribution. (#12)
- Accept "sample.bai" as index for "sample.bam". (#8)
- SEG import: The option
--from-log10
now works to convert log10 ratio values to log2 scale.
Documentation has also improved substantially, including the installation instructions. The built-in help text for each command now shows default values for each option, where applicable.
v0.3.3
Release v0.3.0:
- Enable
batch
to be run without specifying tumor samples, in order to only
create a reference. - Copy ratios are now re-centered at the smoothed "mode" (peak density) rather
than median, for better behavior on samples with many large-scale losses. - Minor fixes and improvements to several safety checks in response to feedback
from users.