Skip to content

Factorize metagenotypes to infer strains and their abundances

License

Notifications You must be signed in to change notification settings

bsmith89/StrainFacts

Repository files navigation

StrainFacts

StrainFacts is a "strain deconvolution" tool, which infers both strain genotypes and their relative abundance across samples directly from metagenotype data.

A "metagenotype" for a particular species in a particular sample is an multidimensional array counting the number of reads (from shotgun metagenomic sequencing, usually) that contained a particular allele at polymorphic sites in that species's genome. Metagenotypes from multiple (maybe many) samples stacked together form the main input data-type analyzed by StrainFacts.

Citation

For detailed information, check out the paper:

B.J. Smith, X. Li, Z.J. Shi, A. Abate, K.S. Pollard. Scalable microbial strain inference in metagenomic data using StrainFacts. Frontiers in Bioinformatics (2022) doi:10.3389/fbinf.2022.867386

Installation

The simplest possible installation is with pip.

pip install StrainFacts

However, installing as a conda environment (described below) is probably the more robust option.

Installing PyTorch/Pyro with GPU support may be more challenging. Consider using a Docker/Singularity container like this one, which has many of the necessary prerequisites (but excluding StrainFacts itself).

Examples

Short vignettes explaining how to use StrainFacts for data processing, visualization, and evaluation are provided as Jupyter Notebooks:

Usage

(NOTE: All of the files described in those notebooks and the usage examples below can be built automatically with Make e.g. try running: make examples/sim.filt.fit.world.nc.)

Get help

Get CLI help:

sfacts --help  # List subcommands
sfacts fit --help  # Subcommand specific CLI usage

Fit a model to metagenotype data

Fit metagenotype data with 15 strains and default everything:

sfacts fit \
    --verbose \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.mgen.nc sim.filt.fit.world.nc

For the manuscript, StrainFacts was run with the following hyperparameters set explicitly:

sfacts fit -m ssdd3_with_error  \
    --verbose --device cpu \
    --precision 32 \
    --random-seed 0 \
    --strains-per-sample 0.3 \
    --num-positions 5000 \
    --hyperparameters gamma_hyper=1e-10 \
    --hyperparameters pi_hyper=0.3 \
    --hyperparameters rho_hyper=0.5 \
    --hyperparameters alpha_hyper_mean=10.0 \
    --hyperparameters alpha_hyper_scale=1e-06 \
    --anneal-hyperparameters gamma_hyper=1.0 \
    --anneal-hyperparameters rho_hyper=5.0 \
    --anneal-hyperparameters pi_hyper=1.0 \
    --anneal-steps 10000 --anneal-wait 2000 \
    --optimizer-learning-rate 0.05 \
    --min-optimizer-learning-rate 1e-06 \
    --max-iter 1_000_000 --lag1 50 --lag2 100 \
    examples/sim.filt.mgen.nc examples/sim.filt.fit.world.nc

In addition, values for --strains-per-sample/--num-strains, --num-positions, --precision, --random-seed, --device and input/output files were as described in the paper.

Data Formats

Intermediate files generated by StrainFacts use a NetCDF-based file format, and harness the xarray library for reading and writing1. Such files include input metagenotypes (suggested file extension *.mgen.nc), as well as more complex collections of both latent and observed variables aligned over their strain/sample/position axes (*.world.nc). All of these files are designed to be loaded and worked with using the sfacts Python library2.

For users who prefer to interact with StrainFacts via the command-line interface (CLI) and plain-text files, scripts are provided as subcommands to convert metagenotype TSVs to this format (see sfacts load --metagenotype). as well as to export key parameter estimates after fitting (see sfacts dump).

For example, to export tab-delimited relative abundance and genotypes tables from examples/sim.filt.fit.world.nc:

sfacts dump examples/sim.filt.fit.world.nc \
    --genotype examples/sim.filt.fit.geno.tsv \
    --community examples/sim.filt.fit.comm.tsv
head sim.filt.fit.refit.geno.tsv sim.filt.fit.refit.comm.tsv

# OUTPUT:
#
# ==> sim.filt.fit.refit.geno.tsv <==
# strain	position	genotypes
# 0	0	0.046489883
# 0	1	0.038462497
# 0	2	9.877303e-07
# 0	3	7.4534853e-07
# 0	4	0.9227885
# 0	5	0.99933404
# 0	6	0.9999933
# 0	7	0.9999994
# 0	8	2.3873392e-06
# 
# ==> sim.filt.fit.refit.comm.tsv <==
# sample	strain	communities
# 0	0	0.00062094553
# 0	1	0.00034649053
# 0	2	0.00017691542
# 0	3	0.00060282537
# 0	4	0.000553946
# 0	5	0.0006001372
# 0	6	0.00032402115
# 0	7	0.6081786
# 0	8	0.00042524032

See Also

Alternative deconvolution tools:

Strain Finder is the most similar to StrainFacts, and works well for reasonably sized data. However, it can be very slow as the number of strains and samples gets large.

Metagenotypers

GT-Pro is the preferred metagenotyper for StrainFacts. While other metagenotypers will also work (e.g. MIDAS or StrainPhlan assuming their outputs are formatted correctly), only bi-allelic and core genome SNP sites are currently supported.

UHGG

GT-Pro is based on the UHGG v1.0 species reference database. This is a subset of the recently updated MGnify / UHGG v2.0 database. The MGnify website provides a convenient way to browse this reference.

Contributing to StrainFacts

Opening new issues and pull requests is greatly appreciated. Any information about how you have tried using StrainFacts and succeeded/failed will be appreciated.

Editable Installation

First, clone the repository:

git clone https://github.com/bsmith89/StrainFacts.git
cd StrainFacts

If you want to modify the example notebooks, configure the clean/smudge filter:

make .git_init

Then install with either conda or pip:

With conda

Conda is highly recommended. While installing with pip will also work, some optional dependencies (in particular Scikit-Bio) often fail to install correctly.

Build the environment:

conda env create -n sfacts-dev -f envs/sfacts-dev.yaml
# Also see `make .conda` to run this command automatically.
conda activate sfacts-dev

This includes an embedded pip install --editable for StrainFacts from the cloned repo.

Integration Testing

make clean
make test

This runs a complex workflow, including

  • simulation
  • data conversion
  • model fitting
  • genotype refitting
  • evaluation of estimates against the simulated ground-truth

Note that the evaluation function at the very end of this integration test will fail if SciKit-Bio is not installed correctly.

Prototyping in Jupyter

Using the StrainFacts API and visualization functions inside a Jupyter Notebook can be very convenient.

To start a JupyterLab server, run the following Make command:

make start_jupyter

TODO

  • A script for converting GT-Pro output TSVs to StrainFacts input data
  • A script for extracting estimates from StrainFacts files to TSVs
  • Complete installation instructions, as well as a robust artifact for running StrainFacts on various platforms (e.g. a Dockerfile or Conda environment specification)
  • Example data with a tutorial/vignette for running StrainFacts and interpreting the output.
    • Simple example fitting a metagenotype
    • Flesh out the example with an explanation of all of the steps.
    • Example data from SRA with a GT-Pro step, instead of simulated
    • Update README/Makefile/Examples to match new CLI interfaces.
  • The parameters selected for use in this paper (which we have found to be applicable across a range of datasets) will be set as the default.
    • Caveat: All of the hyperparameter annealing parameters should still be set by the user
    • UPDATE: A new model/hyperparameters/fitting approach seems broadly superior. This will be described in an upcoming manuscript.
  • We will document some useful approaches for hyperparameter tuning.
  • Improve the CLI documentation.
    • Explicit blurb about how to run the model as in the paper
    • Implement model description strings.
    • Write model descriptions.
  • Refactor for the best looking code
    • De-pluralize core datatypes (community not communities)
    • Remove unused CLI apps/workflows/plotting code
    • Use Python logging module instead of shoe-string info() function.
    • Drop stale code in sfacts.data, sfacts.evaluation, and sfacts.plot
    • Simplify the model structure decorator.
    • Fix mismatch between 'gamma'/'pi' and 'genotype'/'community'. (Probably just change variable names in the model, yeah?)
    • Move choice of strain numbers for fit and init apps to a set of helper functions.
    • Refactor CLI app architecture to be more intuitive.
  • Memoization of expensive calculations on wrapped Datasets and DataArrays.
  • Better tools for model selection (e.g. more evaluation scores based on only metagenotypes and fits)
  • Split out metagenotype/community/genotype handling to its own package: mgtp and make this a StrainFacts dependency.
  • Release on package indexes
    • PyPI
    • Bioconda
  • Upgrade pdist/linkage calculation
    • Better caching of pdist/linkage calculations.
    • Use fastcluster package for linkage calculations
    • Optimal leaf ordering
  • Filter warnings from e.g. numpy/pytorch/xarray in the calling functions (rather than the application layer).

Footnotes

  1. This has benefits for both computational and programmer efficiency.

  2. It should be fairly straight-forward to load these NetCDF formatted files in other programming environments, but this has not been tested.

About

Factorize metagenotypes to infer strains and their abundances

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages