The ID16S pipeline reconstructs the composition of bacterial species from 16S amplicon sequences. Inferences are derived from BLAST homology searches against the NCBI 16S ribosomal RNA (rRNA) database. Raw and normalized results based on rRNA copy variation (calculated using a custom database built from NCBI RefSeq bacterial genomes) are calculated, then plotted with matplotlib. For a good overview of rRNA copy variation, please see this excellent review by Lavrinienko et al.
The ID16S pipeline was tested on Nanopore 1D reads obtained with the 16S Barcoding Kit (SQK-RAB204). Identification accuracy parallels that of the Nanopore sequencing reads. This pipeline should work on all 16S datasets but full length 16S amplicons are preferable.
Note that for large datasets, BLAST homology searches will take a while to complete, even with the megablast algorithm. People interested in faster tools should look at Kraken2. The later is based on kmers and is much faster than BLAST approaches but produces a lower recall with Nanopore reads due to their lower acccuracy. For an excellent comparison of the recall rate from nanopore reads with BLAST and Kraken2, see this paper by Pearman et al.
- Perl 5 and its PerlIO::gzip module
- Python 3 and matplotlib
- BLAST+
Installing ID16S
The ID16S pipeline can be downloaded from GitHub with git
:
git clone https://github.com/PombertLab/ID16S.git && cd ID16S
The ID16S pipeline expects two environment variables: $ID16S_HOME
and $ID16S_DB
. To setup ID16S and its environment variables:
DB=~/ID16S_DB ## Replaced by desired database path
CFG=~/ID16S.sh ## Replaced by desired configuration file
setup_ID16S.pl \
-d $DB \
-c $CFG
source $CFG
Downloading databases
A total of 3 NCBI datasets are required for the ID16S pipeline:
- The NCBI 16S ribosomal RNA database - 16S_ribosomal_RNA.tar.gz
- The NCBI Taxonomy database - taxdb.tar.gz
- The NCBI Taxonomy dumps - taxdump.tar.gz
IDS16 also uses a custom database that normalizes rRNA copy numbers to account for differences in rRNA gene copies accross lineages. This normalization database is generated from the reference genomes found in NCBI RefSeq. The ID16S pipeline comes with a precompiled normalization database (last updated May 30th 2024). If desired, an updated version can be created from the command line (see below).
The required databases can be downloaded as follows:
## To download the databases to the folder set by the $ID16S_DB variable
download_ID16S_dbs.pl -d
## To download the databases to a specified folder
download_ID16S_dbs.pl -d -o ./ID16S_DB ## Replace by desired location
The -c (--create)
flag can be used to create a new, up-to-date rRNA normalization database. However, note that creating this database anew will take time and a substantial amount of disk space (as of May 30th 2024: 40,853 RefSeq genomes; 125GB Gb download; 410G Gb unpacked).
To download the NCBI databases and create a new, up-to-date rRNA normalization database (to a specified directory):
OUTDIR=~/ID16S_DB ## Replace by desired destination directory
download_ID16S_dbs.pl \
--download \
--outdir $OUTDIR \
--create
The ID16S pipeline consists of a few simple steps:
- It converts FASTQ files to FASTA format with fastq2fasta.pl.
- It performs homology searches against the NCBI 16S rRNA database with megablast.pl.
- It summarizes the taxonomic composition of the datasets with taxid_dist.pl.
- It plots the species composition inferred within each sample with matplotlib.
The ID16S pipeline can be run via its run_ID16S.pl master script. To run ID16S, provide run_ID16S.pl with the desired FASTA/Q files and output directory:
$OUTDIR=~/RESULTS ## Replace dy desired output directory
run_ID16S.pl \
-fa *.fasta \
-fq *.fastq.gz \
-o $OUTDIR
Options for run_ID16S.pl are:
-o (--outdir) Output directory [Default = ./ID16S]
-fa (--fasta) FASTA files to run
-fq (--fastq) FASTQ files to convert then run
-hd (--headcrop) Remove the first X nucleotides from 5' end of FASTQ sequences ## Useful for Nanopore data
-m (--min_length) Minimum read length to keep from FASTQ files [Default: 1000]
-d (--db) Path to ID16S databases [Default = \$ID16S_DB]
-v (--version) Show script version
--replot Replot figures with different parameters (skip homology searches)
# BLAST
-k (--tasks) megablast, dc-megablast, blastn [default = megablast]
-t (--threads) CPUs to use [default = 10]
-cu (--culling) Culling limit [default = 10]
-h (--hits) Number of hits to return [Default = 1]
-pe (--p_evalue) Preliminary e-value cutoff for BLAST results [Default = 1e-05]
# Output
-r (--ranks) Output files by taxonomic ranks [Default: species genus family order class]
-fe (--f_evalue) Final e-value cutoff for BLAST results [Default = 1e-75]
-co (--concat) Concatenate all results into a single file [Default: off]
# Plots
--height Figure height in inches [Default: 10]
--width Figure width in inches [Default: 10]
--fontsize Figure fontsize [Default: 12]
--legend Legend label [Default: NonNormalized]
--cutoff Hide taxons below specified threshold [Default = 0.01]
--format Output formats: png, eps, jpeg, pdf, svg, and/or tiff [Default: png svg]
Two small 16S nanopore amplicon samples are provided in the Example/
subdirectory as GZIP-compressed FASTQ files. Sample 1 is a Mycoplasma sample whereas sample 2 is from Staphylococcus.
To reconstruct the 16S amplicon composition within each sample with ID16S:
OUTDIR=~/RESULTS ## Replace by desired destination directory
run_ID16S.pl \
-fq $ID16S_HOME/Example/*.fastq.gz \
-o $OUTDIR
The content of the results directory will look like this:
ls -la $OUTDIR
drwxr-xr-x 2 jpombert jpombert 4096 May 31 11:05 BLAST
drwxr-xr-x 2 jpombert jpombert 4096 May 31 11:05 FASTA
drwxr-xr-x 2 jpombert jpombert 4096 May 31 11:05 NonNormalized
drwxr-xr-x 3 jpombert jpombert 4096 May 31 11:06 Normalized
drwxr-xr-x 4 jpombert jpombert 4096 May 31 11:06 Plots
The contents of the subdirectories are:
- BLAST: Results from BLAST homology searches
- FASTA: FASTA files and/or FASTQ files (converted to FASTA) entered from the command line prompt
- NonNormalized: Raw rRNA counts and distributions
- Normalized: rRNA counts and distributions normalized for rRNA copy number variations
- Plots: Matplotlib distibution plots (at various taxonomic ranks) in PNG/SVG format for each sample investigated
The non-normalized results can be found in the NonNormalized/
directory, and will look similar to:
head -n 8 $OUTDIR/NonNormalized/sample_1*
==> RESULTS/NonNormalized/sample_1.fasta.megablast.family <==
family TaxID Number Percent (total = 814)
Metamycoplasmataceae 2895623 814 100.00%
==> RESULTS/NonNormalized/sample_1.fasta.megablast.genus <==
genus TaxID Number Percent (total = 814)
Mycoplasmopsis 2767358 806 99.02%
Metamycoplasma 2895509 8 0.98%
==> RESULTS/NonNormalized/sample_1.fasta.megablast.order <==
order TaxID Number Percent (total = 814)
Mycoplasmoidales 2790996 814 100.00%
==> RESULTS/NonNormalized/sample_1.fasta.megablast.species <==
species TaxID Number Percent (total = 814)
Mycoplasmopsis fermentans 2115 663 81.45%
Mycoplasmopsis arginini 2094 126 15.48%
Mycoplasmopsis caviae 55603 13 1.60%
Metamycoplasma canadense 29554 2 0.25%
Metamycoplasma gateae 35769 2 0.25%
Metamycoplasma auris 51363 2 0.25%
Mycoplasmopsis hyopharyngis 29558 2 0.25%
head -n 8 $OUTDIR/NonNormalized/sample_2*
==> RESULTS/NonNormalized/sample_2.fasta.megablast.class <==
class TaxID Number Percent (total = 751)
Bacilli 91061 751 100.00%
==> RESULTS/NonNormalized/sample_2.fasta.megablast.family <==
family TaxID Number Percent (total = 751)
Staphylococcaceae 90964 740 98.54%
Bacillaceae 186817 8 1.07%
Aerococcaceae 186827 1 0.13%
Enterococcaceae 81852 1 0.13%
Listeriaceae 186820 1 0.13%
==> RESULTS/NonNormalized/sample_2.fasta.megablast.genus <==
genus TaxID Number Percent (total = 751)
Staphylococcus 1279 738 98.27%
Bacillus 1386 4 0.53%
Aerococcus 1375 1 0.13%
Oceanobacillus 182709 1 0.13%
Anaerobacillus 704093 1 0.13%
Sutcliffiella 2837511 1 0.13%
Vagococcus 2737 1 0.13%
==> RESULTS/NonNormalized/sample_2.fasta.megablast.order <==
order TaxID Number Percent (total = 751)
Bacillales 1385 749 99.73%
Lactobacillales 186826 2 0.27%
==> RESULTS/NonNormalized/sample_2.fasta.megablast.species <==
species TaxID Number Percent (total = 751)
Staphylococcus lloydii 2781774 581 77.36%
Staphylococcus durrellii 2781773 86 11.45%
Staphylococcus kloosii 29384 29 3.86%
Staphylococcus hominis 1290 12 1.60%
Staphylococcus succinus 61015 6 0.80%
Staphylococcus edaphicus 1955013 3 0.40%
Staphylococcus taiwanensis 2750012 2 0.27%
The normalized results can be found in the Normalized/
directory, and will look similar to:
head -n 8 $OUTDIR/Normalized/sample_1*
==> RESULTS/Normalized/sample_1_fasta_megablast_family_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Metamycoplasmataceae 2895623 family 100.00 100.00 0.00
==> RESULTS/Normalized/sample_1_fasta_megablast_genus_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Mycoplasmopsis 2767358 genus 99.02 99.02 0.00
Metamycoplasma 2895509 genus 0.98 0.98 0.00
==> RESULTS/Normalized/sample_1_fasta_megablast_order_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Mycoplasmoidales 2790996 order 100.00 100.00 0.00
==> RESULTS/Normalized/sample_1_fasta_megablast_species_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Mycoplasmopsis fermentans 2115 species 81.45 89.78 8.33
Mycoplasmopsis arginini 2094 genus 15.48 8.53 -6.95
Mycoplasmopsis caviae 55603 genus 1.60 0.88 -0.72
Metamycoplasma gateae 35769 genus 0.25 0.14 -0.11
Metamycoplasma canadense 29554 genus 0.25 0.14 -0.11
Mycoplasmopsis hyopharyngis 29558 genus 0.25 0.14 -0.11
Metamycoplasma auris 51363 genus 0.25 0.14 -0.11
head -n 8 $OUTDIR/Normalized/sample_2*
==> RESULTS/Normalized/sample_2_fasta_megablast_class_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Bacilli 91061 class 100.00 100.00 0.00
==> RESULTS/Normalized/sample_2_fasta_megablast_family_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Staphylococcaceae 90964 family 98.54 98.49 -0.05
Bacillaceae 186817 family 1.07 0.93 -0.14
Aerococcaceae 186827 family 0.13 0.23 0.10
Listeriaceae 186820 family 0.13 0.19 0.06
Enterococcaceae 81852 family 0.13 0.16 0.03
==> RESULTS/Normalized/sample_2_fasta_megablast_genus_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Staphylococcus 1279 genus 98.27 98.09 -0.18
Bacillus 1386 genus 0.53 0.47 -0.06
Aerococcus 1375 genus 0.13 0.23 0.10
Salinicoccus 45669 genus 0.13 0.23 0.10
Nosocomiicoccus 489909 genus 0.13 0.23 0.10
Listeria 1637 genus 0.13 0.19 0.06
Oceanobacillus 182709 genus 0.13 0.13 0.00
==> RESULTS/Normalized/sample_2_fasta_megablast_order_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Bacillales 1385 order 99.73 99.34 -0.39
Lactobacillales 186826 order 0.27 0.66 0.39
==> RESULTS/Normalized/sample_2_fasta_megablast_species_Normalized_Microbiome_Composition.tsv <==
###Organism Name TaxID Taxo Level Non-normalized % of sample Normalized % of sample Delta
Staphylococcus lloydii 2781774 genus 77.36 77.09 -0.27
Staphylococcus durrellii 2781773 genus 11.45 11.41 -0.04
Staphylococcus kloosii 29384 genus 3.86 3.85 -0.01
Staphylococcus hominis 1290 species 1.60 1.59 -0.01
Staphylococcus succinus 61015 genus 0.80 0.80 0.00
Staphylococcus edaphicus 1955013 genus 0.40 0.40 0.00
Staphylococcus carnosus 1281 species 0.27 0.37 0.10
In the plots below, the relative proportions within the samples at each taxonomic rank, i.e. species, genus, family, and so forth, are plotted from the normalized (shown as boxes) and non-normalized (blue dots) rRNA counts.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PMID: 2231712 DOI: 10.1016/S0022-2836(05)80360-2.
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019 Nov 28;20(1):257. PMID: 31779668 PMCID: PMC6883579 DOI: 10.1186/s13059-019-1891-0.
Pearman WS, Freed NE, Silander OK Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads BMC Bioinformatics. 2020 May 29;21(1):220. PMID: 32471343 PMCID: PMC7257156 DOI: 10.1186/s12859-020-3528-4.
Lavrinienko A, Jernfors T, Koskimäki JJ, Pirttilä AM, Watts PC. Does Intraspecific Variation in rDNA Copy Number Affect Analysis of Microbial Communities? Trends Miccrobiol. 2021 Jan;29(1):19-27. PMID: 32593503 DOI: 10.1016/j.tim.2020.05.019.