StrainFacts

StrainFacts is a "strain deconvolution" tool, which infers both strain genotypes and their relative abundance across samples directly from metagenotype data.

A "metagenotype" for a particular species in a particular sample is an multidimensional array counting the number of reads (from shotgun metagenomic sequencing, usually) that contained a particular allele at polymorphic sites in that species's genome. Metagenotypes from multiple (maybe many) samples stacked together form the main input data-type analyzed by StrainFacts.

Citation

For detailed information, check out the paper:

B.J. Smith, X. Li, Z.J. Shi, A. Abate, K.S. Pollard. Scalable microbial strain inference in metagenomic data using StrainFacts. Frontiers in Bioinformatics (2022) doi:10.3389/fbinf.2022.867386

Installation

The simplest possible installation is with pip.

pip install StrainFacts

However, installing as a conda environment (described below) is probably the more robust option.

Installing PyTorch/Pyro with GPU support may be more challenging. Consider using a Docker/Singularity container like this one, which has many of the necessary prerequisites (but excluding StrainFacts itself).

Examples

Short vignettes explaining how to use StrainFacts for data processing, visualization, and evaluation are provided as Jupyter Notebooks:

Usage

(NOTE: All of the files described in those notebooks and the usage examples below can be built automatically with Make e.g. try running: make examples/sim.filt.fit.world.nc.)

Get help

Get CLI help:

sfacts --help  # List subcommands
sfacts fit --help  # Subcommand specific CLI usage

Fit a model to metagenotype data

Fit metagenotype data with 15 strains and default everything:

sfacts fit \
    --verbose \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.mgen.nc sim.filt.fit.world.nc

For the manuscript, StrainFacts was run with the following hyperparameters set explicitly:

sfacts fit -m ssdd3_with_error  \
    --verbose --device cpu \
    --precision 32 \
    --random-seed 0 \
    --strains-per-sample 0.3 \
    --num-positions 5000 \
    --hyperparameters gamma_hyper=1e-10 \
    --hyperparameters pi_hyper=0.3 \
    --hyperparameters rho_hyper=0.5 \
    --hyperparameters alpha_hyper_mean=10.0 \
    --hyperparameters alpha_hyper_scale=1e-06 \
    --anneal-hyperparameters gamma_hyper=1.0 \
    --anneal-hyperparameters rho_hyper=5.0 \
    --anneal-hyperparameters pi_hyper=1.0 \
    --anneal-steps 10000 --anneal-wait 2000 \
    --optimizer-learning-rate 0.05 \
    --min-optimizer-learning-rate 1e-06 \
    --max-iter 1_000_000 --lag1 50 --lag2 100 \
    examples/sim.filt.mgen.nc examples/sim.filt.fit.world.nc

In addition, values for --strains-per-sample/--num-strains, --num-positions, --precision, --random-seed, --device and input/output files were as described in the paper.

Data Formats

Intermediate files generated by StrainFacts use a NetCDF-based file format, and harness the xarray library for reading and writing¹. Such files include input metagenotypes (suggested file extension *.mgen.nc), as well as more complex collections of both latent and observed variables aligned over their strain/sample/position axes (*.world.nc). All of these files are designed to be loaded and worked with using the sfacts Python library².

For users who prefer to interact with StrainFacts via the command-line interface (CLI) and plain-text files, scripts are provided as subcommands to convert metagenotype TSVs to this format (see sfacts load --metagenotype). as well as to export key parameter estimates after fitting (see sfacts dump).

For example, to export tab-delimited relative abundance and genotypes tables from examples/sim.filt.fit.world.nc:

sfacts dump examples/sim.filt.fit.world.nc \
    --genotype examples/sim.filt.fit.geno.tsv \
    --community examples/sim.filt.fit.comm.tsv

head sim.filt.fit.refit.geno.tsv sim.filt.fit.refit.comm.tsv

# OUTPUT:
#
# ==> sim.filt.fit.refit.geno.tsv <==
# strain	position	genotypes
# 0	0	0.046489883
# 0	1	0.038462497
# 0	2	9.877303e-07
# 0	3	7.4534853e-07
# 0	4	0.9227885
# 0	5	0.99933404
# 0	6	0.9999933
# 0	7	0.9999994
# 0	8	2.3873392e-06
# 
# ==> sim.filt.fit.refit.comm.tsv <==
# sample	strain	communities
# 0	0	0.00062094553
# 0	1	0.00034649053
# 0	2	0.00017691542
# 0	3	0.00060282537
# 0	4	0.000553946
# 0	5	0.0006001372
# 0	6	0.00032402115
# 0	7	0.6081786
# 0	8	0.00042524032

Contributing to StrainFacts

Opening new issues and pull requests is greatly appreciated. Any information about how you have tried using StrainFacts and succeeded/failed will be appreciated.

Editable Installation

First, clone the repository:

git clone https://github.com/bsmith89/StrainFacts.git
cd StrainFacts

If you want to modify the example notebooks, configure the clean/smudge filter:

make .git_init

Then install with either conda or pip:

With `conda`

Conda is highly recommended. While installing with pip will also work, some optional dependencies (in particular Scikit-Bio) often fail to install correctly.

Build the environment:

conda env create -n sfacts-dev -f envs/sfacts-dev.yaml
# Also see `make .conda` to run this command automatically.
conda activate sfacts-dev

This includes an embedded pip install --editable for StrainFacts from the cloned repo.

Integration Testing

make clean
make test

This runs a complex workflow, including

simulation
data conversion
model fitting
genotype refitting
evaluation of estimates against the simulated ground-truth

Note that the evaluation function at the very end of this integration test will fail if SciKit-Bio is not installed correctly.

Prototyping in Jupyter

Using the StrainFacts API and visualization functions inside a Jupyter Notebook can be very convenient.

To start a JupyterLab server, run the following Make command:

make start_jupyter

TODO

This has benefits for both computational and programmer efficiency. ↩
It should be fairly straight-forward to load these NetCDF formatted files in other programming environments, but this has not been tested. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 554 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
envs		envs
examples		examples
scripts		scripts
sfacts		sfacts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conda_env.yaml		conda_env.yaml
requirements.pip		requirements.pip
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StrainFacts

Citation

Installation

Examples

Usage

Get help

Fit a model to metagenotype data

Data Formats

See Also

Alternative deconvolution tools:

Metagenotypers

UHGG

Contributing to StrainFacts

Editable Installation

With `conda`

Integration Testing

Prototyping in Jupyter

TODO

About

Releases 5

Packages

Contributors 2

Languages

License

bsmith89/StrainFacts

Folders and files

Latest commit

History

Repository files navigation

StrainFacts

Citation

Installation

Examples

Usage

Get help

Fit a model to metagenotype data

Data Formats

See Also

Alternative deconvolution tools:

Metagenotypers

UHGG

Contributing to StrainFacts

Editable Installation

With conda

Integration Testing

Prototyping in Jupyter

TODO

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Languages

With `conda`

Packages