Motivation: Due to the decreasing costs of genome sequencing researchers are increasingly assessing the genome information of non-model organisms using Next Generation Sequencing technologies. An important question to answer when assessing a new genome sequence is: What is the ploidy level of the organism under study?
Results: We have developed ploidyNGS
, a model-free, open source tool to visualize and explore ploidy levels in a newly sequenced genome, exploiting short read data. We tested ploidyNGS
using both simulated and real NGS data of the model yeast Saccharomyces cerevisiae. ploidyNGS
allows the identification of the ploidy level of a newly sequenced genome in a visual way. We have applied ploidyNGS
to a wide range of different genome ploidy and heterozigosity levels, as well as to a range of sequencing depths.
Availability and implementation: ploidyNGS
is available under the GNU General Public License (GPL) at https://github.com/diriano/ploidyNGS. ploidyNGS
is implemented in Python and R.
- python >=3
- pysam >=0.19.1
- biopython >=1.3.9
- R and ggplot2
cd ~
git clone https://github.com/diriano/ploidyNGS.git
Make sure pip is installed. If not, then please check your distribution's documentation and install it. In Ubuntu 20.04.4 LTS you can do:
sudo apt install python-pip
Install the virtualenv module:
pip install --upgrade pip
pip install virtualenv
If you do not want, or you cannot, install these two modules globally, you could run:
pip install --user --upgrade pip
pip install --user virtualenv
Go to the ploidyNGS folder, create and activate a new virtual environment
cd ploidyNGS
virtualenv .venv
source .venv/bin/activate
Install dependencies:
Use pip
and the requirements.txt
file:
pip install -r requirements.txt
At this point you can deactive the python virtual environment:
deactivate
In Ubuntu 16.04 LTS:
sudo apt install r-base
sudo apt install r-cran-ggplot2
In order to use ploidyNGS
please start by activating the python virtual environment that you created before:
cd ~
cd ploidyNGS
source .venv/bin/activate
And then run:
./ploidyNGS.py -o diploidTest -b test_data/simulatedDiploidGenome/Ploidy2.bowtie2.sorted.bam
This should print the following in your screen:
###############################################################
## This is ploidyNGS version v3.0.0
## Current date and time: Sun Feb 26 12:26:05 2017
###############################################################
No index available for pileup. Creating an index...
Getting the number of mapped reads from BAM
460400
And should generate the files:
- diploidTest_depth100.tab
- diploidTest_depth100.tab.PloidyNGS.pdf
The PDF file should have a histogram identical to this https://github.com/diriano/ploidyNGS/tree/master/images/diploidTest_depth100.tab.PloidyNGS.png
After running ploidyNGS
do not forget to deactivate your python virtual environment:
deactivate
Active the python virtual environment, created above, before using ploidyNGS
:
cd ~/ploidyNGS/
source .venv/bin/activate
The switch -h
will give you access to the full help page:
./ploidyNGS.py -h
###############################################################
## This is ploidyNGS version v3.0.0
## Current date and time: Sun Feb 26 12:26:05 2017
###############################################################
usage: ploidyNGS.py [-h] [-v] -o file -b mappingGenome.bam [-m 0.95 default)]
[-d 100 (default] [-g] [-c {15,25,50,100}]
ploidyNGS: Visual exploration of ploidy levels
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-o file, --out file Base name for a TAB file that will keep the allele
counts
-b mappingGenome.bam, --bam mappingGenome.bam
BAM file used to get allele frequencies
-m 0.95 (default), --max_allele_freq 0.95 (default)
Fraction of the maximum allele frequency (float betwen
0 and 1, default: 0.95)
-d 100 (default), --max_depth 100 (default)
Max number of reads kepth at each position in the
reference genome (integer, default: 100)
-u 0 (default), --min_cov 0 (default)
Minimum coverage required to use a position (integer, default: 0)
-g, --guess_ploidy Try to guess ploidy level by comparison with simulated
data
-c {15,25,50,100}, --coverage_guess_ploidy {15,25,50,100}
This parameter will be automatically set based on the
coverage of your BAM file, however you can overrride
it.
-u,--min_cov Minimum coverage required to use a position
(integer, default: 0)
There are two required parameters: the input BAM file (-b
or --bam
) and a string (-o
or --out
) that will be used to generate the output files.
Additional parameters help you control some of the behavior of ploidyNGS
:
-m
or--max_allele_freq
: Float between 0 and 1. Default 0.95. Ignore positions where the frequencey of the most abundant allele is higher or equal thanmax_allele_freq
-d
or--max_depth
: Integer. Default 100. Maximum sequencing depth to consider, e.g., if d=100, then only the first 100 mapped reads will be examined.-g
pr--guess_ploidy
: Boolean. If present,ploidyNGS
will try to guess the ploidy level of your data by comparing your allele frequency distribution to our simulated data, and computing the Kolmogorov-Smirnov distance, this will generate a table with the suffix: .ks-distance.PloidyNGS.tbl, alsoploidyNGS
will report on the screen the ploidy of the most similar distribution in the simulated data.-c
or--coverage_guess_ploidy
: One of 15, 25, 50, 100. When you select the option --guess_ploidy your sample will be compared with simulated datasets of a given coverage.ploidyNGS.py
will automatically select which coverage value to use base on the actual observed average coverage in your sample. You can override this behaviour by using --coverage_guess_ploidy and one of the values 15, 25, 50 or 100, ie., the 4 subsets of simulated data that we provide, that correspond to different sequencing depths/coverages.
We have prepared four test datasets, representing ploidy levels from 1 to 4, that you can use to try ploidyNGS
. These data are available in the folder test_data:
- HaploidGenome
- simulatedDiploidGenome
- simulatedTetraploidGenome
- simulatedTriploidGenome
ploidyNGS.py
only requires a BAM and a base name for the output files. We will show you how to use it using two of the pre-generated test datasets:
First using the HaploidGenome data:
cd ~/ploidyNGS
source .venv/bin/activate
mkdir myTest
./ploidyNGS.py --out myTest/DataTestPloidy1 --bam test_data/HaploidGenome/Ploidy1.bowtie2.sorted.bam
deactivate
The script outputs two files:
DataTestPloidy2.tab_depth100.tab
, the table with the frequencies of allele proportions in heterozygous sites. It has the four columns: Sequence name, Position, Type of allele (First, Second, Thirds and Fourth) ordered by abundance, and the frequencing of the allele.DataTestPloidy2.tab_depth100.tab.PloidyNGS.pdf
, the histogram of allele frequencies. This is your main tool to decide on the ploidy level of yur organism.
Compare your file DataTestPloidy1.tab_depth100.tab.PloidyNGS.pdf it should be pretty similar to test_data/ploidyNGS_results/DataTestPloidy1.tab.ExplorePloidy.png
What you should see in that histogram is one peak around 95% for the series labelled First, which is the most common allele, this means that the most frequent was represented by 95% or more of the reads in most positions, which is compatible with a haploid genome. The series labelled Second, should have a peak around 5%, this means that in hereromorphic positions, the second most frequent allele have around 5% of the reads, these most likely represent sequencing errors.
In case you want ploidyNGs to try to guess the ploidy level in your sample, you should use the argument --guess_ploidy, see:
cd ~/ploidyNGS
source .venv/bin/activate
mkdir myTest
./ploidyNGS.py --guess_ploidy --out myTest/DataTestPloidy1_guessPloidy --bam test_data/HaploidGenome/Ploidy1.bowtie2.sorted.bam
deactivate
This will print the following in your screen:
###############################################################
## This is ploidyNGS version v3.1.3
## Current date and time: Tue Jun 28 14:52:25 2022
###############################################################
BAM index present... OK!
Number of mapped reads from BAM: 206062
Observed average coverage: 65.00
Coverage used for guessing ploidy: 50
After comparing your data with our simulated dataset
and computing the Kolmogorov-Smirnov distance,
the closest ploidy to yours is 1
You will also find the file myTest/DataTestPloidy1_guessPloidy_depth100.ks-distance.PloidyNGS.tbl
, that has a table with the comparison of your allele frequency distribution to all the simulated datasets with 100x coverage (this is the default coverage for comparison, you can change it with the argument --coverage_guess_ploidy).
Now, let's have a look into a diploid dataset:
cd ~/ploidyNGS
source .venv/bin/activate
./ploidyNGS.py --out myTest/DataTestPloidy2.tab --bam test_data/simulatedDiploidGenome/Ploidy2.bowtie2.sorted.bam --guess_ploidy
deactivate
Compare your file DataTestPloidy2.tab_depth100.tab.PloidyNGS.pdf it should be pretty similar to test_data/ploidyNGS_results/DataTestPloidy2.tab.ExplorePloidy.png
You should see two peaks in the histogram, both located around 50%. This means that both the most frequent and second most frequence alleles, each, have around 50% of the reads in the heteromorphic positions. This result is compatible with a diploid genome.
Try running ploidyNGS for the simulated triploid and tetraploid datasets. Read the following section for more details on how to interpret the histogram generated by ploidyNGS
.
The original idea with ploidyNGS
was to have a simple, model-free, visualization of the frequency of alternative alleles on a newly sequenced genome, using short reads next generation sequencing. Other software had tried to implement proper ploidy level estimation, but this usually requires complex models, that could be very sensitive to deviation of their assumptions (See for instance: Bao et al., 2014; Margarido & Heckerman, 2015; Yu et al., 2014). One of our main goals was to provide a model-free visualization of the polymorphisms found in a newly sequenced genome, and let the user decide about the ploidy level, based on visual comparisons with simulated data also provided by ploidyNGS
. We understand that inferring ploidy level could be a valuable feature for end-users, and we have implemented a simple routine to compare the profile generated from the user’s data to the sets of simulated profile distributions for different ploidy and heterozygosity levels and sequencing coverages. In this way ploidyNGS
can try to guess the ploidy level of the user provided data. However, there are several factors impacting on the polymorphism profile distribution, and the ultimate decision about ploidy level relies on the inspection of the ploidyNGS
generated graph by the user, and not ploidyNGS
's guess, this is highlighted in the output of ploidyNGS
. The routine that we have implemented in ploidyNGS computes the Kolmogorov-Smirnov distance between pairs of distributions (user vs simulated) and suggests the ploidy of the distribution with the smallest distance.
For the interpretation of the resulting plots you can either use our simulation data (https://github.com/diriano/ploidyNGS/blob/master/simulation/results/results_ploidy_simulations.html) or the following table. Briefly, in ploidyNGS
's histograms, the position of the peaks on the X-axis is directly related to the ploidy level of the organism under study. For instance, if you observe peaks for the two most frequent alleles close to 50%, you have a diploid organism (see DataTestPloidy2.tab.ExplorePloidy.pdf
). The values on the X-axis represent the fraction of reads representing each of the 4 possible alleles.
Ploidy level | Second most frequent allele (expected proportion) | Most frequence allele (expected proportion) |
---|---|---|
Diploid | 50 | 50 |
Triploid | 33.33 | 66.67 |
Tetraploid | 25 | 75 |
Tetraploid | 50 | 50 |
Pentaploid | 20 | 80 |
Pentaploid | 40 | 60 |
Hexaploid | 16.67 | 83.33 |
Hexaploid | 50 | 50 |
Hexaploid | 33.33 | 66.67 |
Heptaploid | 28.57 | 71.43 |
Heptaploid | 42.86 | 57.14 |
Heptaploid | 14.29 | 85.71 |
When using the parameter --guess_ploidy, ploidyNGS.py
will try to guess the ploidy in your sample by comparing the pattern in which first and second most frequent alleles appear to the same patterns of simulated data of known ploidy. For this we have generated simulated data at different coverages, i.e., 15, 25, 50 and 100. ploidyNGS.py
will select the closest coverage to yours, in the following way:
- If the average coverage in your sample is smaller than 20, your data will be compared against datasets of coverage 15.
- If you coverage is >=20 and < 37.5, then it will compared against datasets of coverage 25.
- If you coverage is >=37.5 and < 75, then it will compared against datasets of coverage 50.
- If you coverage is >=75, then it will compared against datasets of coverage 100.
The Kolmogorov-Smirnov will be computed between pairs of datasets, i.e., you dataset vs one of the simulated dataset of a given coverage and ploidy. The comparison with the smallest distance will be reported as the possible ploidy of your sample. We advise that you always look at the PDF generated by ploidyNGS.py
, and do not rely only on the guess that we provide. If you find any inconsistencies please contact us.
ploidyNGS
was developed to assess ploidy level in small genomes, of a few tens Mbp. If your genome is larger than that we recommend that you:
a) look at individual chromosomes. For this you can use bamtools, check: https://www.biostars.org/p/46327/.
b) look at a region of a single chromosome.
c) Look at regions of at least 1Mbp.
The running time and memory usage of ploidyNGS
depends on the number of reads in the BAM file, i.e., the sequencing depth. We have seen that a sequencing depth of 100x is enough to get a good idea of the ploidy level. So if you have sequenced your genome to a larger depth, please sub-sample you data before creating the BAM file for ploidyNGS
. Alternatively you could use the parameter -d
in ploidyNGS
to look at the d
first mapped reads, however pysam will still load the full BAM file.
Margarido, G. R. A. and Heckerman, D. (2015). ConPADE: Genome assembly ploidy estimation from next-generation sequencing data. PLoS Comput. Biol., 11(4), e1004229.
Bao, L., Pu, M., and Messer, K. (2014). AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from nextgeneration sequencing data. Bioinformatics, 30(8), 1056–1063.
Yu, Z., Liu, Y., Shen, Y.,Wang, M., and Li, A. (2014). CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data. Bioinformatics, 30(18), 2576–2583.