IMPORTANT

Please don't use this initial version - use this one instead

MAGpy

MAGpy is a Snakemake pipeline for annotating metagenome-assembled genomes (MAGs) (pronounced mag-pie)

NOTE MAGpy is in "alpha" status. We like to release early and often. MAGpy works on our system and was used in https://doi.org/10.1101/162578 but we cannot guarantee it will work on your system. We may not have enough resource at present to help you get it to work on your system either, sorry, but we will try

MAGpy takes as input a directory of FASTA files (with the extension .fa). Each FASTA file should contain contigs that make up a Metagenome-Assembled Genome, or MAG. MAGpy then runs a range of software tools that help annotate and characterise those genomes. Specifically:

CheckM is run to characterise genome completeness and contamination
Sourmash is used to compare the genomes to RefSeq and GenBank genomes
Prodigal is used to predict protein sequences
Diamond is used to map the protein sequences against UniProt
Scripts are run to generate reports from the diamond and sourmash outputs
Ete3 scripts are used to update taxonomic annotations with full(er) lineages
PhyloPhlAn is used to generate a tree from the genomes
GraPhlAn is used to draw the tree using annotations previously generated [not yet implemented]

How to run

Snakemake can be run in basic mode by running:

snakemake -s /path/to/MAGpy

Outputs will be placed into the current working directory, so make sure you have write access.

To test which commands snakemake will run, you can try:

snakemake -np -s /path/to/MAGpy

However, on any serious number of MAGs, this basic operation will take a very long time as each job will be run in serial (i.e. one after the other). However, snakemake has the ability to submit to most HPC clusters. There are some instructions here.

Here at Edinbugh, we run an SGE cluster and this is how we run MAGpy on the cluster:

snakemake --cluster-config MAGpy.json --cluster "qsub -cwd -pe sharedmem {cluster.core} -l h_rt= {cluster.time} -l h_vmem={cluster.vmem} -P {cluster.proj}" --jobs 1000

This mode looks into the MAGpy.json file for cluster configurations relating to each type of job; the jobs are "rules" within the MAGpy snakefile.

The integration of PhyloPhlAn

OK, this is a bit complex. Essentially, PhyloPhlAn has a few foibles, which are:

input to PhyloPhlAn has to be placed in the input/ directory contains within the PhyloPhlAn install directory
output from PhyloPhlAn is written to the output/ directory within the PhyloPhlAn install directory
The PhyloPhlAn process has to be run from the root of the PhyloPhlAn install directory

Therefore, whatever user is running the MAGpy process, whether it be on a cluster or a single machine, must have read and write access to the input/ and output/ directories in the PhyloPhlAn install directory

Here is what MAGpy attempts to do:

It attempts to create a symbolic link from the input/ directory to the newly created proteins directory using ln -s
It then attempts to cd into the PhyloPhlAn install directory
From there, it runs PhyloPhlAn
When finished, it attempts to mv the results folder back to the original directory (to folder tree)
MAGpy then changes back to the original working directory

Now obviously this is a bit, erm, hacky but as long as permissions are set on the PhyloPhlAn directory correctly, it should work.

Abstraction of executables

Here in Edinburgh, access to software on the cluster nodes is controlled by the module command. So, if you want to run Samtools it's

module load samtools

samtools view my.bam

This isn't easy to integrate into Snakemake for every command, so what we have done is created a shell script for each software tool we need to use. The sole purpose of that shell script is to set the correct enviornment for the tool to run, and then to run the tool with the arguments passed to the shell script.

So our shell script for Samtools would be:

#!/bin/bash

module load samtools

samtools $@

(we don't use Samtools, this is just an example!)

So basically, what you need to do is edit the shell scripts such that they set the correct environment for the tool in question and then run the tool on your specific set up. Then you need to but these in your PATH and MAGpy then runs these shell scripts, rather than the executables themselves. This allows a lot of flexibility for running MAGpy on a range of different set-ups

Databases and data sources

MAGpy needs to know about the location of various diamond databases, Pfam etc. You will also need to edit MAGpy to point to these resources.

Dependencies

Snakemake (tested with 4.1.0)
Python (tested with 2.7.5)
Perl (tested with 5.16.3)
BioPerl (tested with 1.6.924)
Ete3 (tested with 3.0.0b36)
CheckM (tested with 1.0.5)
HMMER (tested with 3.1b2)
pplacer (tested with 1.1)
FastTree (tested with 2.1.10)
Prodigal (tested with 2.6.3)
Diamond (tested with 0.8.22)
Sourmash (tested with 2.0.0a1)
PhyloPhlAn (tested with 0.99)
pfam_scan.pl (tested on 1.6)
Color::Mix
BioPython
GraPhlAn

Requirements

UniProt TrEMBL and/or Swiss-prot, formatted for searching by Diamond
Sourmash indices for both RefSeq and GenBank genomes
Pfam - needs to be pressed with hmmpress, see pfam_scan.pl help

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
envs		envs
mags		mags
scripts		scripts
shells		shells
LICENSE		LICENSE
MAGpy		MAGpy
MAGpy.json		MAGpy.json
MAGpy_dag.png		MAGpy_dag.png
README.md		README.md
config.json		config.json
install.md		install.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMPORTANT

MAGpy

How to run

The integration of PhyloPhlAn

Abstraction of executables

Databases and data sources

Dependencies

Requirements

About

Releases

Packages

Languages

License

mw55309/MAGpy

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT

MAGpy

How to run

The integration of PhyloPhlAn

Abstraction of executables

Databases and data sources

Dependencies

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages