Table of Contents
This is a simple and crudely made Snakemake workflow for compiling phylogenetic datasets from extracted coding sequences of nuclear (for now) protein-coding genes. It can work reasonably well with sequences retrieved by running BUSCO as input. There is very little flexibility as to the tools used for the different steps and the parameters as this mainly for my personal use at the moment. I plan to add new functionality (alternative tools for the same rules, or more options in the rules themselves) later.
-
You need to have
snakemake
andGNU Coreutils
installed, both of which can be installed in aconda
environment. I will add anenvironment.yaml
for this workflow that will make things easier later. -
Clone this repository recursively and cd into it
git clone --recursive https://github.com/etkayapar/pcc
cd pcc
- You need to have all your input genes as fasta files in
input/genewise_fastas
(relative to the project root directory), and their names in a single-column (for now)gene_table.tsv
file in the project root directory that looks like below.
(First line is a header)
gene
geneName1
geneName2
assuming that you have the FASTA files in paths like:
input/genewise_fastas/geneName1.fa
input/genewise_fastas/geneName2.fa
Gene names should not have any underscores or hyphens or any other breaking character (for now) they can only include alphanumeric characters (upper or lowercase letters are fine). FASTA files themselves are exptected to have the extension .fa
and not any other common versions such has .fna
or .fasta
It is only possible to run it in three distinct steps with the target rules are named as first_pass
, second_pass
, and conclude
.
snakemake --sdm conda --cores <NUM_THREADS> first_pass
Would do:
- Align the translated gene sequences by
mafft
with theL-INS-i
algorithm - Backtranslate the aligned amino-acid alignments into nucleotide alignments
- Remove fasta entries from genes that consist of gaps entirely (Genes that had a header but no sequence in the input nucleotide fasta files)
- Infer gene trees using IQ-TREE
- Collect all gene trees into a single
.treefile
- Detect potential outlier sequences from gene trees using TreeShrink
- Remove sequences from gene alignment (or discard entire gene alignments) according to the output of the previous step.
snakemake --sdm conda --cores <NUM_THREADS> second_pass
- Unalign the the processed gene alignment sequences that made the filtering step, translate into amino-acid sequences.
- Realign the amino-acid sequences
- Run
trimAl
with--automated1
heuristic to mark candidate columns to retain after getting rid of gap-rich columns and backtranslate the amino-acid alignment to a nucleotide alignment while only keeping the columns deemed ok bytrimAl
. - Repeat steps 5-7 for the trimal-processed alignments.
snakemake --sdm conda --cores <NUM_THREADS> conclude
- Repeat steps 8,9 for the trimal-processed alignments.
- Backtranslate the resulting amino-acid alignments.
- Collect the final nucleotide alignments into a directory
- run
concat-aln
to concatenate the genewise alignments into a supermatrix inPHYLIP
format (output/supermatrix.phy
) and also generate a partition table in theNEXUS
format (output/supermatrix.nex
)
If you are sure that you don't want to supervise how to workflow executes in between the three steps explained above, you may try to run the entire workflow by using:
snakemake --sdm conda --cores <NUM_THREADS> first_pass && \
snakemake --sdm conda --cores <NUM_THREADS> second_pass && \
snakemake --sdm conda --cores <NUM_THREADS> conclude
This sequence of snakemake calls should run the three consecutive steps while running a given step only if the previous step was successfully executed.