Strain-informed gene content inference from shotgun metagenomes
The StrainPGC manuscript is currently under revisions. For now, please cite the BioRxiv preprint.
StrainPGC is a Python package. A Python version >=3.11 is recommended (although other versions may also work) StrainPGC can be installed directly from GitHub using pip:
git clone https://github.com/bsmith89/StrainPGC StrainPGC
cd StrainPGC
pip install -e . # Editable installation for development.
This will also install all Python software dependencies, which currently entail:
- pandas
- xarray
- netcdf4
- scipy
It is also recommended to install StrainPGC in an isolated Python environment (e.g. a conda env
).
The following example demonstrates running StrainPGC on a small input dataset.
From the root of the StrainPGC repository,
cd examples/core_example
Example input data for the core spgc
tool are provided for testing and
to demonstrate the correct file formats.
This includes three files:
pangenome_profile.depth.tsv.bz2
(matrix of sample-by-gene depths.)species_genes.list
(list of species-quantitative genes expected to be at 1x depth)strain_pure_samples.tsv
(mapping from strain-pure-samples to their strain identity)
With correctly formatted input files, like those provided, the StrainPGC method can be run as follows:
spgc run \
pangenome_profile.depth.tsv.bz2 \
species_genes.list \
strain_pure_samples.tsv \
spgc.results.nc
The output file, spgc.results.nc
, is an binary format which includes
a variety of statistics about genes and strains.
This file can be parsed directly using the XArray library in Python.
spgc dump_genes spgc.results.nc spgc.gene.tsv
spgc dump_strains spgc.results.nc spgc.strain.tsv
The two output files are:
spgc.gene.tsv
(estimated gene content)spgc.strain.tsv
(strain statistics)
Additional subcommands and options are described in the help:
spgc --help
spgc run --help
A complete StrainPGC workflow going from raw input reads to estimated gene
content has been implemented using Snakemake (see workflow/Snakefile
) and
incorporates metagenome preprocessing, MIDAS, GT-Pro, and StrainFacts.
The following example demonstrates running this workflow on a small number of example shotgun metagenomes.
NOTE: Running this example workflow requires a substantial amount of disk storage, memory, and CPU time. (see below.)
You'll need to install Snakemake (version >= 8.20).
By default, this workflow implementation leverages Snakemake's Apptainer (formerly Singularity) integration to download and run a pre-built container (https://hub.docker.com/r/bsmith89/strainpgc-wf) integrating StrainPGC and all other dependencies of the workflow.
If Apptainer is not available on your system, the same container can be used with Docker and includes a Snakemake installation. Alternatively you can install the necessary software directly to your system.
From the root of the StrainPGC repository,
cd examples/wf_example
Example input files for the workflow are provided for testing.
These raw metagenomes and the necessary GT-Pro reference data can be downloaded using included scripts as follows:
bash scripts/download_gtpro_refs.sh ref/gtpro
bash scripts/download_example_reads.sh raw
species=102506 # MIDAS/UHGG/GT-Pro species ID for E. coli
num_procs=12 # Number of CPU processes
snakemake --profile profiles/apptainer \
--configfile config.yaml \
--dry-run \
"results/species/$species/spgc.strain.tsv" \
"results/species/$species/spgc.gene.tsv"
species=102506 # MIDAS/UHGG/GT-Pro species ID for E. coli
num_procs=12 # Number of CPU processes
snakemake --profile profiles/apptainer \
--configfile config.yaml \
-j "$num_procs" \
"results/species/$species/spgc.strain.tsv" \
"results/species/$species/spgc.gene.tsv"
For different compute environments, a different
Snakemake profiles may be advisable.
For instance, the alternative profile in examples/wf_example/profiles/noenv
assumes
that all necessary software has been installed directly to your system.
The motivation and methodological details are fully described, benchmarked, and demonstrated in our manuscript.
Pangenome profiling methods harness shotgun metagenomics to identify gene families encoded in the genomes of individual strains. By using a reference database of genes or genomes from a given species, genes with sufficiently high mapping depth in a single sample are inferred to be encoded by that species in that sample.
This approach has proven fruitful, but has three major shortcomings when strain-specific gene content is desired:
- In samples with multiple strains, genes may not be assigned when they are missing from one or more of these genomes. due to lower depth relative to core genes.
- Gene assignment error is elevated in low-abundance species due to a lower signal-to-noise ratio in mapping depth.
- Assignments to individual species are ambiguous when gene families are found in the pangenomes of more than one species.
StrainPGC tackles these challenges by combining depth information across multiple samples. Specifically, StrainPGC:
- Combines depth across multiple samples to achieve a higher signal-to-noise ratio
- Also considers the correlation between gene depth and species depth, reducing the impact of cross-mapping of reads across multiple species
- Considers, one at a time, subsets of samples where a single strain is believed to be present in order to overcome the obscuring effects of strain mixing
The core StrainPGC method takes three inputs for each species:
- Pangenome profiles: a sample-by-gene matrix of mapping depths
- Species core genes: a list of genes believed to be found in single copy in every strain of the species
- Strain-pure samples: a mapping of samples to individual strains
A suggested protocol for obtaining each of these inputs directly from raw metagenomic data as part of the larger StrainPGC workflow is described below and implemented as a Snakemake pipeline.
The key result provided by StrainPGC is a strain-by-gene matrix indicating which genes are estimated to be present and which absent in the genomes of each of the strains.
A complete workflow can be divided into four phases:
- Metagenomic profiling, which includes both:
- SNP profiling for strain tracking
- Pangenome profiling
- Strain tracking / identification of strain-pure sets
- Running the StrainPGC algorithm
- Quality assessment / control
We have implemented a version of this workflow as a Snakemake pipeline. Applied to the example data, this pipeline can be visualized as the following graph of dependencies:
The StrainPGC workflow uses GT-Pro for SNP profiling, which captures metagenotypes across polymorphic positions found in the Unified Human Gut Genome reference database.
The StrainPGC workflow implements pangenome profiling against the MIDAS UHGG reference database gene clusters using using Bowtie2-based read mapping.
While other profiling tools may be used, excessive post-hoc filtering of mapping depth can be detrimental and we find competitive mapping to a reference index that includes multiple species reduces the issue of cross mapping of reads from other species.
This step is by far the most computationally intensive, dwarfing by far the runtime and memory requirements of all other steps.
The StrainPGC workflow estimates strain compositions in each sample based on SNP profiles using StrainFacts.
Sets of samples that are pure (or nearly pure) for a single strain are selected based on these estimates.
The quality of gene content assignments for each strain can be assessed post-hoc based on
- The fraction of species core genes assigned
- The noisyness of the normalized depths of strain genes
Each of these scores is calculated by StrainPGC and can be read off the table
of strain statistics obtained with spgc dump_strains
.
Strains with too few core genes (e.g. a species_gene_frac
< 95%) or too much
noise in gene depths (e.g. a log_selected_gene_depth_ratio_std
> 0.25) should
be removed from downstream analyses.
In addition, for the StrainPGC manuscript, we removed strains with fewer than 100 positions confidently genotyped by GT-Pro after StrainFacts partitioning and sample pooling.
The core StrainPGC tool is single-threaded and not computationally intensive: ~6 seconds of CPU time and ~200 MB of RAM for the quick-start example dataset.
However, the integrated workflow requires a substantial amount of storage, memory, and CPU time, in particular due to the requirements for MIDAS.
We used Snakemake to benchmark the resource utilization on our servers using the example described above.
NOTE: After running the example workflow, benchmark results can be collected with the following BASH code.
(
printf 'step\ts\th:m:s\tmax_rss\tmax_vms\tmax_uss\tmax_pss\tio_in\tio_out\tmean_load\tcpu_time\n'
find benchmark/ -type f | xargs -I% awk -v OFS='\t' -v file=% 'NR>1 {print file,$0}' %
) > benchmark.tsv`
To a first approximation, resource requirements include:
- Disk (Between 160 and 200 GB)
- Reference files (downloaded in Step 2, below) total > 50 GB.
- The portion of the UHGG MIDASDB used here (74 species profiled downloaded as one step in the integrated workflow for the ) is > 20 GB.
- The raw shotgun metagenomes downloaded from the HMP2 FTP server totaled > 6 GB.
- Intermediate files generated by the workflow totaled > 70 GB. Some of
these intermediates can be automatically deleted after they've served
their purpose. (They're marked with
temp(...)
in the Snakefile. However, note that the provided Snakemake profiles all include the--notemp
flag for testing purposes.) - Downloaded apptainer containers total > 5 GB
- RAM (~90 GB at peak)
- Building the MIDAS Bowtie2 index with 48 threads required ~90 GB at peak.
- Running MIDAS pangenome profiling required ~32 GB at peak. However, about half of this memory can be shared across samples thanks to memory mapping.
- GT-Pro required ~55 GB at peak for each sample, although almost all of this can be shared across samples.
- Memory requirements for other steps were generally negligible by comparison.
- CPU Time
- In particular, building the Bowtie2 index takes a long time: about ~60 hours of CPU time partially parallelized over 48 processes for a total walltime of ~130 minutes.
- Running MIDAS pangenome profiling for the largest sample (~13 million reads) took ~3.5 hours of CPU time. This was efficiently split across four processes for a total of ~45 minutes of walltime.
- CPU requirements for other steps were generally negligible by comparison.
Using an editable installation with pip install -e
is the best (and easiest)
way to get started on the StrainPGC codebase.
It should be possible to also use the pre-built Docker container by binding the
root of the StrainPGC repository to /src/StrainPGC
in that container.
For instance, the example in the StrainPGC-wf quick-start can be run as follows:
species=102506 # MIDAS/UHGG/GT-Pro species ID for E. coli
num_procs=12 # Number of CPU processes
snakemake --profile profiles/apptainer \
--apptainer-args "--bind /path/to/StrainPGC:/src/StrainPGC"
-j "$num_procs" \
--configfile config.yaml \
"results/species/$species/spgc.strain.tsv" \
"results/species/$species/spgc.gene.tsv"
Replacing /path/to/StrainPGC
with the path to your StrainPGC
code repository.
In this way, edits to the Python package will be reflected in the Snakemake execution.
docker build -t bsmith89/strainpgc-wf -t bsmith89/strainpgc-wf:latest -f workflow/envs/Dockerfile .
docker run -it --rm -v $(pwd)/../..:$(pwd)/../.. bsmith89/strainpgc-wf:latest
docker push bsmith89/strainpgc-wf