What follows is a use case of smRecSearch
as documented in the GitHub Repo for Vazquez and Lynch (2021).
Local Computer: The RecSearch
and de novo transcritome assembly steps are the most resource-intensive parts of this pipeline. RecSearch
was run using workstation equipped with two 20-core processors @ 2.4 GHz and 128 GB RAM; however, RecSearch
has a memory saver option that lowers the memory requirement to 8-16 GB, depending on the genome. The snakemake
rule for RecSearch can be edited to use the memory_saver_level = 2
in order to run this script on common equipment.
Computing Clusters & Cloud Computing: Individual parts of this pipeline were tested on the Midway2 Computing Cluster at the University of Chicago, which uses a SLURM job scheduler. Courtesy of snakemake
, the entire pipeline should be compatible with various remote computing workflows, with caveats discussed in "Usage."
Operating System: This pipeline was developed on Linux, with additional testing on Mac computers. I would strongly urge Windows uses to set up Windows Subsystem for Linux before running this - and really any - computational analysis.
- Install
conda
,bioconda
, andsnakemake
on your computer. - Install
gfServer
andgfClient
from UCSC: http://hgdownload.soe.ucsc.edu/admin/exe/ - Clone this repo:
git clone https://github.com/docmanny/smRecSearch.git
- Create and install the conda environment needed for this pipeline:
conda env create --name RecSearch --file envs/conda_env.yaml
- Activate the environment using
conda activate RecSearch
.
The core program underlying this pipeline, RecSearch
, is flexible in its use of genome types and search algorithms. To reproduce this paper, you will need the genomes in the following table. Be sure to save them to the data/genomes
folder.
Species | Common Name | Highest Quality Genome | Citation/Link |
---|---|---|---|
Choloepus hoffmanni | Hoffmans two-toed sloth | choHof-C_hoffmanni-2.0.1_HiC | DNAZoo |
Chrysochloris asiatica | Cape golden mole | chrAsi1 | NCBI: chrAsi1 |
Dasypus novemcinctus | Nine-banded armadillo | dasNov3 | NCBI: dasNov3 |
Echinops telfairi | Lesser Hedgehog Tenrec | echTel2 | NCBI: echTel2 |
Elephantulus edwardii | Cape elephant shrew | eleEdw1 | NCBI: eleEdw1 |
Elephas maximus | Asian elephant | eleMaxD | Palkopoulou et al. 2018 |
Loxodonta africana | African savanna elephant | loxAfr4 | ftp://ftp.broadinstitute.org/pub/assemblies/mammals/elephant/loxAfr4 |
Loxodonta cyclotis | African forest elephant | loxCycF | Palkopoulou et al. 2018 |
Mammut americanum | American mastodon | mamAmeI | Palkopoulou et al. 2018 |
Mammuthus columbi | Columbian mammoth | mamColU | Palkopoulou et al. 2018 |
Mammuthus primigenius | Woolly mammoth | mamPriV | Palkopoulou et al. 2015 |
Orycteropus afer | Aardvark | oryAfe2 | DNAZoo |
Palaeoloxodon antiquus | Straight tusked elephant | palAntN | Palkopoulou et al. 2018 |
Procavia capensis | Rock hyrax | proCap-Pcap_2.0_HiC | DNAZoo |
Trichechus manatus latirostris | Manatee | triManLat2 | DNAZoo |
Homo sapiens | Human | hg38 | NCBI: hg38 |
To use BLAT and gfServer/gfClient, you must specify a port for each gfServer. Additionally, various parts of this pipeline require interconverting between species names and genome assemblies. To facilitate this, a file has been included named portTable.csv
in the data
folder. Various genomes and species have already been included in this file along with suggested ports. If on your system, these ports are used by other processes, they can be changed without issue.
You should save all query sequences in data/input
. Note that currently the pipeline assumes a FASTA input sequence with the extension ".fa", however, this can be easily changed in rules/RecSearch.smk
. For convenience, the AvA.fa
file containing the master list of sequences is included in data/input
.
Our pipeline will generate de novo transcriptomes for target genomes using SRA identifiers and the HISAT2-StringTie pipeline. A table of SRAs for Loxodonta africana, Trichechus manatus, and Dasypus novemcinctus is included in data/SraRunTable/SraRunTable.csv
.
In addition to performing RBH Searches, this pipeline can intersect the hits with other lines of evidence to validate the results, and return a list of "evidenced" hits. To do so, simply download the other evidence as either a BED or a GFF file into either data/BED
or data/GFF
, respectively.
I highly suggest reading both the snakemake
tutorial and looking through the different snakefiles in the rules/
folder to familiarize yourself with the rules.
While it is possible to run the entire pipeline from start to end using snakemake --use-conda publication/manuscript.pdf
, I would strongly recommend the following execution order:
- Generate the modified
hg38
for reciprocal best hit searches using your query file of interest:
snakemake --use-conda output/recBlastDBPrep/hg38_maskRep_noVarChr_fragWithGenes.2bit
- Next, confirm that
RecSearch
runs correctly for one genome:
snakemake --use-conda -npr output/loxAfr4/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/RBB/loxAfr4_RecBlastOutput.bed.rbb
- Generate the transcriptomic evidence for functional duplicates:
snakemake --use-conda -npr data/BED/loxAfr4-finalGuide.bed
Once these steps are troubleshooted, it is possible to repeat the RecSearch and evidence steps individually for each genome; or continue to the next step and allow snakemake
to generate them automatically.
- Generate the table of genes per genome with copy number and ECNC scores (plus required files):
snakemake --use-conda -npr output/geneCopyTable/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/atlantogenata-GeneCopyTable_{RBB,evidenced}_filtered_long.csv
- Generate the maximum-likelihood tree for ancestral gene copy numbers:
snakemake --use-conda -npr output/iqtree/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/maxLikelihood_model_MK+FQ+I+G4-dataType_MORPH-asrMin_0.8/atlantogenata_RBB_filtered_dyn.iqtree
- Generate the final manuscript:
snakemake --use-conda -npr publication/manuscript.pdf
If you run into any issues with this pipeline, or see that there are incomplete rules, please submit an Issue using a reproducible example; including all error codes, log files, and any leads or investigative work that you did will go a long way towards making this manuscript and workflow even better, faster! While I can't guarantee that this will work on any computer, I can at least attest that it works well on Linux and Mac systems, in addition to SLURM clusters.