Skip to content

This pipeline reconstructs the composition of species from 16S amplicon sequencing datasets.

License

Notifications You must be signed in to change notification settings

PombertLab/ID16S

Repository files navigation

ID16S – 16S amplicon identification

DOI

Table of contents

Introduction

The ID16S pipeline reconstructs the composition of bacterial species from 16S amplicon sequences. Inferences are derived from BLAST homology searches against the NCBI 16S ribosomal RNA (rRNA) database. Raw and normalized results based on rRNA copy variation (calculated using a custom database built from NCBI RefSeq bacterial genomes) are calculated, then plotted with matplotlib. For a good overview of rRNA copy variation, please see this excellent review by Lavrinienko et al.

The ID16S pipeline was tested on Nanopore 1D reads obtained with the 16S Barcoding Kit (SQK-RAB204). Identification accuracy parallels that of the Nanopore sequencing reads. This pipeline should work on all 16S datasets but full length 16S amplicons are preferable.

Note that for large datasets, BLAST homology searches will take a while to complete, even with the megablast algorithm. People interested in faster tools should look at Kraken2. The later is based on kmers and is much faster than BLAST approaches but produces a lower recall with Nanopore reads due to their lower acccuracy. For an excellent comparison of the recall rate from nanopore reads with BLAST and Kraken2, see this paper by Pearman et al.

Dependencies

Getting started

Installing ID16S

The ID16S pipeline can be downloaded from GitHub with git:

git clone https://github.com/PombertLab/ID16S.git && cd ID16S

The ID16S pipeline expects two environment variables: $ID16S_HOME and $ID16S_DB. To setup ID16S and its environment variables:

DB=~/ID16S_DB           ## Replaced by desired database path
CFG=~/ID16S.sh          ## Replaced by desired configuration file

setup_ID16S.pl \
  -d $DB \
  -c $CFG

source $CFG

Downloading databases

A total of 3 NCBI datasets are required for the ID16S pipeline:

  1. The NCBI 16S ribosomal RNA database - 16S_ribosomal_RNA.tar.gz
  2. The NCBI Taxonomy database - taxdb.tar.gz
  3. The NCBI Taxonomy dumps - taxdump.tar.gz

IDS16 also uses a custom database that normalizes rRNA copy numbers to account for differences in rRNA gene copies accross lineages. This normalization database is generated from the reference genomes found in NCBI RefSeq. The ID16S pipeline comes with a precompiled normalization database (last updated May 30th 2024). If desired, an updated version can be created from the command line (see below).

The required databases can be downloaded as follows:

## To download the databases to the folder set by the $ID16S_DB variable
download_ID16S_dbs.pl -d

## To download the databases to a specified folder
download_ID16S_dbs.pl -d -o ./ID16S_DB    ## Replace by desired location

The -c (--create) flag can be used to create a new, up-to-date rRNA normalization database. However, note that creating this database anew will take time and a substantial amount of disk space (as of May 30th 2024: 40,853 RefSeq genomes; 125GB Gb download; 410G Gb unpacked).

To download the NCBI databases and create a new, up-to-date rRNA normalization database (to a specified directory):

OUTDIR=~/ID16S_DB                ## Replace by desired destination directory

download_ID16S_dbs.pl \
  --download \
  --outdir $OUTDIR \
  --create

Running ID16S

The ID16S pipeline consists of a few simple steps:

  1. It converts FASTQ files to FASTA format with fastq2fasta.pl.
  2. It performs homology searches against the NCBI 16S rRNA database with megablast.pl.
  3. It summarizes the taxonomic composition of the datasets with taxid_dist.pl.
  4. It plots the species composition inferred within each sample with matplotlib.

The ID16S pipeline can be run via its run_ID16S.pl master script. To run ID16S, provide run_ID16S.pl with the desired FASTA/Q files and output directory:

$OUTDIR=~/RESULTS                ## Replace dy desired output directory

run_ID16S.pl \
  -fa *.fasta \
  -fq *.fastq.gz \
  -o $OUTDIR

Options for run_ID16S.pl are:

-o  (--outdir)      Output directory [Default = ./ID16S]
-fa (--fasta)       FASTA files to run
-fq (--fastq)       FASTQ files to convert then run
-hd (--headcrop)    Remove the first X nucleotides from 5' end of FASTQ sequences ## Useful for Nanopore data
-m  (--min_length)  Minimum read length to keep from FASTQ files [Default: 1000]
-d  (--db)          Path to ID16S databases [Default = \$ID16S_DB]
-v  (--version)     Show script version
--replot            Replot figures with different parameters (skip homology searches)

# BLAST
-k  (--tasks)       megablast, dc-megablast, blastn [default = megablast]
-t  (--threads)     CPUs to use [default = 10]
-cu (--culling)     Culling limit [default = 10]
-h  (--hits)        Number of hits to return [Default = 1]
-pe (--p_evalue)    Preliminary e-value cutoff for BLAST results [Default = 1e-05]

# Output
-r  (--ranks)       Output files by taxonomic ranks [Default: species genus family order class]
-fe (--f_evalue)    Final e-value cutoff for BLAST results [Default = 1e-75]
-co (--concat)      Concatenate all results into a single file [Default: off]

# Plots
--height            Figure height in inches [Default: 10]
--width             Figure width in inches [Default: 10]
--fontsize          Figure fontsize [Default: 12]
--legend            Legend label [Default: NonNormalized]
--cutoff            Hide taxons below specified threshold [Default = 0.01]
--format            Output formats: png, eps, jpeg, pdf, svg, and/or tiff [Default: png svg]

Running ID16S - Example

Two small 16S nanopore amplicon samples are provided in the Example/ subdirectory as GZIP-compressed FASTQ files. Sample 1 is a Mycoplasma sample whereas sample 2 is from Staphylococcus.

To reconstruct the 16S amplicon composition within each sample with ID16S:

OUTDIR=~/RESULTS          ## Replace by desired destination directory

run_ID16S.pl \
  -fq $ID16S_HOME/Example/*.fastq.gz \
  -o $OUTDIR

The content of the results directory will look like this:

ls -la $OUTDIR

drwxr-xr-x 2 jpombert jpombert 4096 May 31 11:05 BLAST
drwxr-xr-x 2 jpombert jpombert 4096 May 31 11:05 FASTA
drwxr-xr-x 2 jpombert jpombert 4096 May 31 11:05 NonNormalized
drwxr-xr-x 3 jpombert jpombert 4096 May 31 11:06 Normalized
drwxr-xr-x 4 jpombert jpombert 4096 May 31 11:06 Plots

The contents of the subdirectories are:

  • BLAST: Results from BLAST homology searches
  • FASTA: FASTA files and/or FASTQ files (converted to FASTA) entered from the command line prompt
  • NonNormalized: Raw rRNA counts and distributions
  • Normalized: rRNA counts and distributions normalized for rRNA copy number variations
  • Plots: Matplotlib distibution plots (at various taxonomic ranks) in PNG/SVG format for each sample investigated

The non-normalized results can be found in the NonNormalized/ directory, and will look similar to:

head -n 8 $OUTDIR/NonNormalized/sample_1*

==> RESULTS/NonNormalized/sample_1.fasta.megablast.family <==
family  TaxID   Number  Percent (total = 814)
Metamycoplasmataceae    2895623 814     100.00%

==> RESULTS/NonNormalized/sample_1.fasta.megablast.genus <==
genus   TaxID   Number  Percent (total = 814)
Mycoplasmopsis  2767358 806     99.02%
Metamycoplasma  2895509 8       0.98%

==> RESULTS/NonNormalized/sample_1.fasta.megablast.order <==
order   TaxID   Number  Percent (total = 814)
Mycoplasmoidales        2790996 814     100.00%

==> RESULTS/NonNormalized/sample_1.fasta.megablast.species <==
species TaxID   Number  Percent (total = 814)
Mycoplasmopsis fermentans       2115    663     81.45%
Mycoplasmopsis arginini 2094    126     15.48%
Mycoplasmopsis caviae   55603   13      1.60%
Metamycoplasma canadense        29554   2       0.25%
Metamycoplasma gateae   35769   2       0.25%
Metamycoplasma auris    51363   2       0.25%
Mycoplasmopsis hyopharyngis     29558   2       0.25%
head -n 8 $OUTDIR/NonNormalized/sample_2*

==> RESULTS/NonNormalized/sample_2.fasta.megablast.class <==
class   TaxID   Number  Percent (total = 751)
Bacilli 91061   751     100.00%

==> RESULTS/NonNormalized/sample_2.fasta.megablast.family <==
family  TaxID   Number  Percent (total = 751)
Staphylococcaceae       90964   740     98.54%
Bacillaceae     186817  8       1.07%
Aerococcaceae   186827  1       0.13%
Enterococcaceae 81852   1       0.13%
Listeriaceae    186820  1       0.13%

==> RESULTS/NonNormalized/sample_2.fasta.megablast.genus <==
genus   TaxID   Number  Percent (total = 751)
Staphylococcus  1279    738     98.27%
Bacillus        1386    4       0.53%
Aerococcus      1375    1       0.13%
Oceanobacillus  182709  1       0.13%
Anaerobacillus  704093  1       0.13%
Sutcliffiella   2837511 1       0.13%
Vagococcus      2737    1       0.13%

==> RESULTS/NonNormalized/sample_2.fasta.megablast.order <==
order   TaxID   Number  Percent (total = 751)
Bacillales      1385    749     99.73%
Lactobacillales 186826  2       0.27%

==> RESULTS/NonNormalized/sample_2.fasta.megablast.species <==
species TaxID   Number  Percent (total = 751)
Staphylococcus lloydii  2781774 581     77.36%
Staphylococcus durrellii        2781773 86      11.45%
Staphylococcus kloosii  29384   29      3.86%
Staphylococcus hominis  1290    12      1.60%
Staphylococcus succinus 61015   6       0.80%
Staphylococcus edaphicus        1955013 3       0.40%
Staphylococcus taiwanensis      2750012 2       0.27%

The normalized results can be found in the Normalized/ directory, and will look similar to:

head -n 8 $OUTDIR/Normalized/sample_1*

==> RESULTS/Normalized/sample_1_fasta_megablast_family_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Metamycoplasmataceae    2895623 family  100.00  100.00  0.00

==> RESULTS/Normalized/sample_1_fasta_megablast_genus_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Mycoplasmopsis  2767358 genus   99.02   99.02   0.00
Metamycoplasma  2895509 genus   0.98    0.98    0.00

==> RESULTS/Normalized/sample_1_fasta_megablast_order_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Mycoplasmoidales        2790996 order   100.00  100.00  0.00

==> RESULTS/Normalized/sample_1_fasta_megablast_species_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Mycoplasmopsis fermentans       2115    species 81.45   89.78   8.33
Mycoplasmopsis arginini 2094    genus   15.48   8.53    -6.95
Mycoplasmopsis caviae   55603   genus   1.60    0.88    -0.72
Metamycoplasma gateae   35769   genus   0.25    0.14    -0.11
Metamycoplasma canadense        29554   genus   0.25    0.14    -0.11
Mycoplasmopsis hyopharyngis     29558   genus   0.25    0.14    -0.11
Metamycoplasma auris    51363   genus   0.25    0.14    -0.11
head -n 8 $OUTDIR/Normalized/sample_2*

==> RESULTS/Normalized/sample_2_fasta_megablast_class_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Bacilli 91061   class   100.00  100.00  0.00

==> RESULTS/Normalized/sample_2_fasta_megablast_family_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Staphylococcaceae       90964   family  98.54   98.49   -0.05
Bacillaceae     186817  family  1.07    0.93    -0.14
Aerococcaceae   186827  family  0.13    0.23    0.10
Listeriaceae    186820  family  0.13    0.19    0.06
Enterococcaceae 81852   family  0.13    0.16    0.03

==> RESULTS/Normalized/sample_2_fasta_megablast_genus_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Staphylococcus  1279    genus   98.27   98.09   -0.18
Bacillus        1386    genus   0.53    0.47    -0.06
Aerococcus      1375    genus   0.13    0.23    0.10
Salinicoccus    45669   genus   0.13    0.23    0.10
Nosocomiicoccus 489909  genus   0.13    0.23    0.10
Listeria        1637    genus   0.13    0.19    0.06
Oceanobacillus  182709  genus   0.13    0.13    0.00

==> RESULTS/Normalized/sample_2_fasta_megablast_order_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Bacillales      1385    order   99.73   99.34   -0.39
Lactobacillales 186826  order   0.27    0.66    0.39

==> RESULTS/Normalized/sample_2_fasta_megablast_species_Normalized_Microbiome_Composition.tsv <==
###Organism Name        TaxID   Taxo Level      Non-normalized % of sample      Normalized % of sample  Delta
Staphylococcus lloydii  2781774 genus   77.36   77.09   -0.27
Staphylococcus durrellii        2781773 genus   11.45   11.41   -0.04
Staphylococcus kloosii  29384   genus   3.86    3.85    -0.01
Staphylococcus hominis  1290    species 1.60    1.59    -0.01
Staphylococcus succinus 61015   genus   0.80    0.80    0.00
Staphylococcus edaphicus        1955013 genus   0.40    0.40    0.00
Staphylococcus carnosus 1281    species 0.27    0.37    0.10

Examples of distribution boxplots generated by ID16S:

In the plots below, the relative proportions within the samples at each taxonomic rank, i.e. species, genus, family, and so forth, are plotted from the normalized (shown as boxes) and non-normalized (blue dots) rRNA counts.

Sample 1 (species level):

Sample 1 (genus level):

Sample 2 (species level):

Sample 2 (genus level):

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PMID: 2231712 DOI: 10.1016/S0022-2836(05)80360-2.

Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019 Nov 28;20(1):257. PMID: 31779668 PMCID: PMC6883579 DOI: 10.1186/s13059-019-1891-0.

Pearman WS, Freed NE, Silander OK Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads BMC Bioinformatics. 2020 May 29;21(1):220. PMID: 32471343 PMCID: PMC7257156 DOI: 10.1186/s12859-020-3528-4.

Lavrinienko A, Jernfors T, Koskimäki JJ, Pirttilä AM, Watts PC. Does Intraspecific Variation in rDNA Copy Number Affect Analysis of Microbial Communities? Trends Miccrobiol. 2021 Jan;29(1):19-27. PMID: 32593503 DOI: 10.1016/j.tim.2020.05.019.