PCC (Phylogenetic dataset Compiler Collection)

Table of Contents

PCC (Phylogenetic dataset Compiler Collection)
- What do you need to use it
- Explanation of the workflow

PCC (Phylogenetic dataset Compiler Collection)

This is a simple and crudely made Snakemake workflow for compiling phylogenetic datasets from extracted coding sequences of nuclear (for now) protein-coding genes. It can work reasonably well with sequences retrieved by running BUSCO as input. There is very little flexibility as to the tools used for the different steps and the parameters as this mainly for my personal use at the moment. I plan to add new functionality (alternative tools for the same rules, or more options in the rules themselves) later.

What do you need to use it

You need to have snakemake and GNU Coreutils installed, both of which can be installed in a conda environment. I will add an environment.yaml for this workflow that will make things easier later.
Clone this repository recursively and cd into it

git clone --recursive https://github.com/etkayapar/pcc
cd pcc

You need to have all your input genes as fasta files in input/genewise_fastas (relative to the project root directory), and their names in a single-column (for now) gene_table.tsv file in the project root directory that looks like below.

(First line is a header)

gene
geneName1
geneName2

assuming that you have the FASTA files in paths like:

input/genewise_fastas/geneName1.fa
input/genewise_fastas/geneName2.fa

Gene names should not have any underscores or hyphens or any other breaking character (for now) they can only include alphanumeric characters (upper or lowercase letters are fine). FASTA files themselves are exptected to have the extension .fa and not any other common versions such has .fna or .fasta

Explanation of the workflow

It is only possible to run it in three distinct steps with the target rules are named as first_pass, second_pass, and conclude.

snakemake --sdm conda --cores <NUM_THREADS> first_pass

Would do:

Align the translated gene sequences by mafft with the L-INS-i algorithm
Backtranslate the aligned amino-acid alignments into nucleotide alignments
Remove fasta entries from genes that consist of gaps entirely (Genes that had a header but no sequence in the input nucleotide fasta files)
Infer gene trees using IQ-TREE
Collect all gene trees into a single .treefile
Detect potential outlier sequences from gene trees using TreeShrink
Remove sequences from gene alignment (or discard entire gene alignments) according to the output of the previous step.

snakemake --sdm conda --cores <NUM_THREADS> second_pass

Unalign the the processed gene alignment sequences that made the filtering step, translate into amino-acid sequences.
Realign the amino-acid sequences
Run trimAl with --automated1 heuristic to mark candidate columns to retain after getting rid of gap-rich columns and backtranslate the amino-acid alignment to a nucleotide alignment while only keeping the columns deemed ok by trimAl.
Repeat steps 5-7 for the trimal-processed alignments.

snakemake --sdm conda --cores <NUM_THREADS> conclude

Repeat steps 8,9 for the trimal-processed alignments.
Backtranslate the resulting amino-acid alignments.
Collect the final nucleotide alignments into a directory
run concat-aln to concatenate the genewise alignments into a supermatrix in PHYLIP format (output/supermatrix.phy) and also generate a partition table in the NEXUS format (output/supermatrix.nex)

If you are sure that you don't want to supervise how to workflow executes in between the three steps explained above, you may try to run the entire workflow by using:

snakemake --sdm conda --cores <NUM_THREADS> first_pass && \
snakemake --sdm conda --cores <NUM_THREADS> second_pass && \
snakemake --sdm conda --cores <NUM_THREADS> conclude

This sequence of snakemake calls should run the three consecutive steps while running a given step only if the previous step was successfully executed.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
envs		envs
figs		figs
rules		rules
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCC (Phylogenetic dataset Compiler Collection)

What do you need to use it

Explanation of the workflow

About

Releases

Packages

Languages

License

etkayapar/pcc

Folders and files

Latest commit

History

Repository files navigation

PCC (Phylogenetic dataset Compiler Collection)

What do you need to use it

Explanation of the workflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages