Skip to content
This repository has been archived by the owner on Nov 7, 2022. It is now read-only.

SchlossLab/Sovacool_OptiFit_mSphere_2022

Repository files navigation

OptiFit

an improved method for fitting amplicon sequences to existing OTUs

build license DOI paper

This repository contains the complete analysis workflow used to benchmark the OptiFit algorithm in mothur and produce the accompanying manuscript. Find details on how to use OptiFit and descriptions of the parameter options on the mothur wiki: https://mothur.org/wiki/cluster.fit/.

Citation

Sovacool KL, Westcott SL, Mumphrey MB, Dotson GA, Schloss PD. 2022. OptiFit: An Improved Method for Fitting Amplicon Sequences to Existing OTUs. mSphere. http://dx.doi.org/10.1128/msphere.00916-21

A bibtex entry for LaTeX users:

@article{sovacool_optifit_2022,
author = {Kelly L. Sovacool  and Sarah L. Westcott  and M. Brodie Mumphrey  and Gabrielle A. Dotson  and Patrick D. Schloss},
title = {OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs},
journal = {mSphere},
year = {2022},
doi = {10.1128/msphere.00916-21}
URL = {https://journals.asm.org/doi/10.1128/msphere.00916-21},

The Workflow

The workflow is split into five subworkflows:

  • 0_prep_db — download & preprocess reference databases.
  • 1_prep_samples — download, preprocess, & de novo cluster the sample datasets.
  • 2_fit_reference_db — fit datasets to reference databases.
  • 3_fit_sample_split — split datasets; cluster one fraction de novo and fit the remaining sequences to the de novo OTUs.
  • 4_vsearch — run vsearch clustering for comparison.

The main workflow (Snakefile) creates plots from the results of the subworkflows and renders the paper.

Quickstart

  1. Before cloning, configure git symlinks:

     git config --global core.symlinks true

    Otherwise, git will create text files in place of symlinks.

  2. Clone this repository.

     git clone https://github.com/SchlossLab/Sovacool_OptiFit_mSphere_2022
     cd Sovacool_OptiFit_mSphere_2022
  3. Install the dependencies.

    Almost all are listed in the conda environment file. Everything needed to run the analysis workflow is listed here.

    conda env create -f config/env.simple.yaml
    conda activate optifit

    Additionally, I used a custom version of ggraph for the algorithm figure. You can install it with devtools from R:

    devtools::install_github('kelly-sovacool/ggraph', ref = 'iss-297_ggtext')

    If you do not have LaTeX already, you'll need to install a LaTeX distribution before rendering the manuscript as a PDF. You can use tinytex to do so:

    tinytex::install_tinytex()

    I also used latexdiffr to create a PDF with changes tracked prior to submitting revisions to the journal.

    devtools::install_github("hughjonesd/latexdiffr")
  4. Run the entire pipeline.

    Locally:

    snakemake --cores 4
    

    Or on an HPC running slurm:

    sbatch code/slurm/submit_all.sh
    

    (You will first need to edit your email and slurm account info in the submission script and cluster config.)

Directory Structure

.
├── OptiFit.Rproj
├── README.md
├── Snakefile
├── code
│   ├── R
│   ├── bash
│   ├── py
│   ├── slurm
│   └── tests
├── config
│   ├── cluster.json
│   ├── config.yaml
│   ├── config_test.yaml
│   ├── env.export.yaml
│   ├── env.simple.yaml
│   └── slurm
│       └── config.yaml
├── docs
│   ├── paper.md
│   ├── paper.pdf
│   └── slides
├── exploratory
│   ├── 2018_fall_rotation
│   ├── 2019_winter_rotation
│   ├── 2020-05_May-Oct
│   ├── 2020-11_Nov-Dec
│   ├── 2021
│   │   ├── figures
│   │   ├── plots.Rmd
│   │   ├── plots.md
│   ├── AnalysisRoadmap.md
│   └── DeveloperNotes.md
├── figures
├── log
├── paper
│   ├── figures.yaml
│   ├── head.tex
│   ├── msphere.csl
│   ├── paper.Rmd
│   ├── preamble.tex
│   └── references.bib
├── results
│   ├── aggregated.tsv
│   ├── stats.RData
│   └── summarized.tsv
└── subworkflows
    ├── 0_prep_db
    │   ├── README.md
    │   └── Snakefile
    ├── 1_prep_samples
    │   ├── README.md
    │   ├── Snakefile
    │   ├── data
    │   │   ├── human
    │   │       └── SRR_Acc_List.txt
    │   │   ├── marine
    │   │       └── SRR_Acc_List.txt
    │   │   ├── mouse
    │   │       └── SRR_Acc_List.txt
    │   │   └── soil
    │   │       └── SRR_Acc_List.txt
    │   └── results
    │       ├── dataset_sizes.tsv
    │       └── opticlust_results.tsv
    ├── 2_fit_reference_db
    │   ├── README.md
    │   ├── Snakefile
    │   └── results
    │       ├── denovo_dbs.tsv
    │       ├── optifit_dbs_results.tsv
    │       └── ref_sizes.tsv
    ├── 3_fit_sample_split
    │   ├── README.md
    │   ├── Snakefile
    │   └── results
    │       ├── optifit_crit_check.tsv
    │       └── optifit_split_results.tsv
    └── 4_vsearch
        ├── README.md
        ├── Snakefile
        └── results
            └── vsearch_results.tsv