Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dram1.4rc #207

Merged
merged 72 commits into from
Sep 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
cc601f4
dbcan: ++ best hits, subfam ec numbers columns change cutoff to 1e-18
rmFlynn Mar 1, 2022
7497f5b
fix up downloads, for dbcan and others
rmFlynn Mar 3, 2022
8639135
Finish up download, and setup
rmFlynn Mar 4, 2022
d33c91f
Merge branch 'master' into dbcan_fix
Mar 7, 2022
1c4d894
Check that ECs go through the full pipeline.
Apr 6, 2022
cc05d81
Add the first logging attempt
Apr 25, 2022
35f24cc
Stuff for logging, move code in cmd to facilitate
Apr 25, 2022
f589ee9
change log file
Apr 25, 2022
ef0b520
changed print statements to logging.info and removed start_time on cm…
ileleiwi May 2, 2022
e5e7f19
fixed setup_logger call
ileleiwi May 2, 2022
72e144d
added an output directory for log and replaced all print statements i…
ileleiwi May 2, 2022
544b95c
added logging to prepare_databases
ileleiwi May 2, 2022
ffd709f
added logging and logging.info messages where files are written out i…
ileleiwi May 2, 2022
70bc7bd
Add one doc string
May 2, 2022
83e07e8
Fix ec numbers, fully add camper, updating
May 5, 2022
3d45fa2
Merge branch 'dram1.4_leleiwi' into dbcan_fix
May 24, 2022
f27835c
add new config
May 24, 2022
1e47309
BIO: Add in dbcan_fix
May 24, 2022
05161f8
Add camper, and genie and also add logging
May 31, 2022
d82989b
small bug fixes, missing arguments, ext
Jun 3, 2022
0f04f41
need to add back in the locs
Jun 3, 2022
cc45734
Finished most of the setup bugs
Jun 7, 2022
5882e1a
less bugs
Jun 10, 2022
d8d7511
Less bugs
Jun 10, 2022
18cf7b8
Merge branch 'dbcan_fix' into finalize_setup
Jun 10, 2022
a217289
Chrush bugs, logging for distill, sulphur placeholder
Jun 20, 2022
be5dfae
Start the process of fixing tests, validating
Jun 24, 2022
d323b09
Now sulphur place holder is inplace
Jun 24, 2022
dbc61d4
Sulpher stuff and bugs
Jun 24, 2022
57961ab
add log back
Jun 27, 2022
01b33e8
Fix update descriptions
Jun 28, 2022
028c38d
working updates
Jul 5, 2022
36d2a7a
Debuging tweeks
Jul 5, 2022
850412f
so many bugs patched
Jul 8, 2022
f29d683
Fix all tests, add adjectives to strainer
Jul 27, 2022
86f9eaf
Enable all databases
Jul 27, 2022
53f9664
add sulphur
Jul 27, 2022
0217263
Remove campers but now I need to add it back to the distillate
Jul 28, 2022
2d7f8c6
Fix some test
Aug 4, 2022
8303b35
Merge branch 'scikit_fix' into dbcan_fix
Aug 10, 2022
d36f381
Read config location from environment variable
merrygoat Aug 15, 2022
3222d72
Remove customs and fix tests
Aug 31, 2022
22d2e53
Merge pull request #202 from merrygoat/custom_config_location
rmFlynn Aug 31, 2022
2b260ef
Git the package fix in circleci
Aug 31, 2022
fb60589
fix one test
Aug 31, 2022
6647934
Change all FTP URLs to HTTP(s) to avoid firewall/ACL issues
mrobbert Sep 1, 2022
ed49ee4
Fix lib call and add config path test
Sep 2, 2022
325263f
Merge branch 'master' into dram1.4rc
rmFlynn Sep 2, 2022
99a1b9e
Merge branch 'add_http_option' into master
rmFlynn Sep 9, 2022
e1196cf
Merge pull request #206 from mrobbert/master
rmFlynn Sep 9, 2022
5e599a9
Literally deleted one trailing space
Sep 9, 2022
fcbb6d5
Add the option for https or any alt link in download
Sep 12, 2022
4501b6d
clean the code a touch for release
Sep 14, 2022
1cf6297
Merge branch 'add_http_option' into dram1.4rc
Sep 14, 2022
2534714
I am once again asking for circleci to run
Sep 14, 2022
5a49160
tweek config
rmFlynn Sep 14, 2022
c002e2f
Updated config.yml
rmFlynn Sep 15, 2022
af02253
Updated config.yml
rmFlynn Sep 15, 2022
f53383f
Updated config.yml
rmFlynn Sep 15, 2022
45ee5af
Updated config.yml
rmFlynn Sep 15, 2022
641b3f4
Updated config.yml
rmFlynn Sep 15, 2022
a3bb318
Updated config.yml
rmFlynn Sep 15, 2022
166001f
Updated config.yml
rmFlynn Sep 15, 2022
cb4f1fb
Updated config.yml
rmFlynn Sep 15, 2022
ed7d464
Bump dbcan to 11
Sep 14, 2022
7e1f58b
Bring public DRAM up to speck, with in house
Sep 22, 2022
a1b0a41
Remove Sulphur legacy from distilate
Sep 22, 2022
5cfb041
Fix one test for one less log
Sep 22, 2022
db73227
Tweak version
Sep 22, 2022
4fd79b6
Last minute edits
Sep 26, 2022
115c6be
Remove some todos and add citation
Sep 28, 2022
6a5f306
Update README
Sep 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@ version: 2.1
jobs:
build-and-test:
docker:
- image: ubuntu:focal
- image: cimg/base:2022.09
steps:
- checkout
- run:
name: Setup Miniconda
command: |
apt update
apt install -y wget
cd $HOME
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
export MINICONDA_PREFIX="$HOME/miniconda"
Expand All @@ -27,22 +25,29 @@ jobs:
conda config --add channels conda-forge
conda info -a
- run:
name: Run tests in enviroment
name: More conda stuff
# This assumes pytest is installed via the install-package step above
command: |
export PATH="$HOME/miniconda/bin:$PATH"
conda update -y conda
conda create -n DRAM python=3.9
source activate DRAM
conda install pandas pytest pandas pytest scikit-bio scipy<=1.8.1 prodigal mmseqs2!=10.6d92c hmmer!=3.3.1 trnascan-se >=2 sqlalchemy barrnap altair >=4 openpyxl networkx ruby parallel dram
pytest tests/test_annotate_bins.py
pytest tests/test_annotate_vgfs.py
pytest tests/test_database_handler.py
pytest tests/test_database_processing.py
pytest tests/test_database_setup.py
pytest tests/test_summarize_genomes.py
pytest tests/test_summarize_vgfs.py
pytest tests/test_utils.py
conda install pandas pytest pandas pytest scikit-bio scipy==1.8.1 prodigal mmseqs2!=10.6d92c hmmer!=3.3.1 trnascan-se>=2 sqlalchemy barrnap altair>=4 openpyxl networkx ruby parallel pip
pip3 install ./
- run:
name: Run tests in enviroment
# This assumes pytest is installed via the install-package step above
command: |
source $HOME/miniconda/bin/activate DRAM
pytest
# pytest tests/test_annotate_bins.py
# pytest tests/test_annotate_vgfs.py
# pytest tests/test_database_handler.py
# pytest tests/test_database_processing.py
# pytest tests/test_database_setup.py
# pytest tests/test_summarize_genomes.py
# pytest tests/test_summarize_vgfs.py
# pytest tests/test_utils.py
# Invoke jobs via workflows
workflows:
all-tests:
Expand Down
96 changes: 70 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,81 @@
# DRAM
[![CircleCI](https://circleci.com/gh/WrightonLabCSU/DRAM/tree/master.svg?style=svg)](https://circleci.com/gh/WrightonLabCSU/DRAM/tree/master)

DRAM (Distilled and Refined Annotation of Metabolism) is a tool for annotating metagenomic assembled genomes and [VirSorter](https://github.com/simroux/VirSorter) identified viral contigs. DRAM annotates MAGs and viral contigs using [KEGG](https://www.kegg.jp/) (if provided by the user), [UniRef90](https://www.uniprot.org/), [PFAM](https://pfam.xfam.org/), [dbCAN](http://bcb.unl.edu/dbCAN2/), [RefSeq viral](https://www.ncbi.nlm.nih.gov/genome/viruses/), [VOGDB](http://vogdb.org/) and the [MEROPS](https://www.ebi.ac.uk/merops/) peptidase database as well as custom user databases. DRAM is run in two stages. First an annotation step to assign database identifiers to gene and then a distill step to curate these annotations into useful functional categories. Additionally viral contigs are further analyzed during to identify potential AMGs. This is done via assigning an auxiliary score and flags representing the confidence that a gene is both metabolic and viral.
DRAM (Distilled and Refined Annotation of Metabolism) is a tool for annotating metagenomic assembled genomes and [VirSorter](https://github.com/simroux/VirSorter) identified viral contigs. DRAM annotates MAGs and viral contigs using [KEGG](https://www.kegg.jp/) (if provided by the user), [UniRef90](https://www.uniprot.org/), [PFAM](https://pfam.xfam.org/), [dbCAN](http://bcb.unl.edu/dbCAN2/), [RefSeq viral](https://www.ncbi.nlm.nih.gov/genome/viruses/), [VOGDB](http://vogdb.org/) and the [MEROPS](https://www.ebi.ac.uk/merops/) peptidase database as well as custom user databases. DRAM is run in two stages. First an annotation step to assign database identifiers to gene, and then a distill step to curate these annotations into useful functional categories. Additionally, viral contigs are further analyzed during to identify potential AMGs. This is done via assigning an auxiliary score and flags representing the confidence that a gene is both metabolic and viral.

For more detail on DRAM and how DRAM works please see our [paper](https://academic.oup.com/nar/article/48/16/8883/5884738) as well as the [wiki](https://github.com/shafferm/DRAM/wiki).
For information on how DRAM is changing, please read the [release note](https://github.com/WrightonLabCSU/DRAM/releases/latest)

## Installation
To install DRAM some dependencies need to be installed first then DRAM can be installed from this repository. In the future DRAM will be available from conda. Dependencies can be installed via conda or manually.

## Getting Started Part 1: Installation

**NOTE** If you already have an old release of DRAM installed and just want to upgrade, then please read the set-up step before you remove your old environment.

To install DRAM you also must install some dependencies. The easiest way to install both DRAM and its dependencies is to use [conda](https://docs.conda.io/en/latest/miniconda.html), but you can also use manual instructions, or if you are an adventurer you can install a release candidate from this repository .

_Conda Installation_

Install DRAM into a new [conda](https://docs.conda.io/en/latest/) environment using the provided
Install DRAM into a new [conda](https://docs.conda.io/en/latest/) environment using the provided
environment.yaml file.
```bash
wget https://raw.githubusercontent.com/shafferm/DRAM/master/environment.yaml
conda env create -f environment.yaml -n DRAM
```
If this installation method is used then all further steps should be run inside the newly created DRAM environment. This environment can be activated using this command:
If this installation method is used, then all further steps should be run inside the newly created DRAM environment, or with the full path to the executable, use `which` with the environment active to find these, the eg. `which DRAM.py`. This environment can be activated using this command:
```bash
conda activate DRAM
```

You have now installed DRAM, and are ready to set up the databases.

_Manual Installation_

If you do not install via a conda environment, then the dependencies [pandas](https://pandas.pydata.org/), [networkx](https://networkx.github.io/), [scikit-bio](http://scikit-bio.org/), [prodigal](https://github.com/hyattpd/Prodigal), [mmseqs2](https://github.com/soedinglab/mmseqs2), [hmmer](http://hmmer.org/) and [tRNAscan-SE](http://lowelab.ucsc.edu/tRNAscan-SE/) need to be installed manually. Then you can install DRAM using pip:
```bash
pip install DRAM-bio
```
Alternatively if you would like to install a development version of DRAM then you can install DRAM by cloning this repository and install using pip and the local repository.

You have now installed DRAM.
You have now installed DRAM, and are ready to set up the databases.

_Release Candidate Installation_

The latest version of DRAM is often a release candidate, and these are not pushed to pypi, or Bioconda and so can't be installed with the methods above. You can tell if there is currently a release candidate by reading the [release note](https://github.com/WrightonLabCSU/DRAM/releases/latest).

To install a potentially unstable release candidate, follow the instructions below. Note the comments within the code sections as there is a context in which commands must be used.

```bash
# Clone the git repository and move into it
git clone https://github.com/WrightonLabCSU/DRAM.git
cd DRAM
# Install dependencies, this will also install a stable version of DRAM that will then be replaced.
conda env create --name my_dram_env -f environment.yaml
conda activate my_dram_env
# Install pip
conda install pip3
pip3 install ./
```

You have now installed DRAM, and are ready to set up the databases.


## Setup
## Getting Started Part 2: Setup Databases

To run DRAM you need to set up the required databases in order to get annotations. All databases except for KEGG can be downloaded and set up for use with DRAM for you automatically. In order to get KEGG gene annotations and you must have access to the KEGG database. KEGG is a paid subscription service to download the protein files used by this annotator. If you do not have access to KEGG then DRAM will automatically use the [KOfam](https://www.genome.jp/tools/kofamkoala/) HMM database to get KEGG Orthology identifiers.
_I Want to Use an Already Setup Databases_

If you already installed and set up a previous version of dram and want to use your old databases, then you can do it with two steps.

Activate your old DRAM environment, and save your old config:

```bash
conda activate my_old_env
DRAM-setup.py export_config > my_old_config.txt
```

Activate your new DRAM environment, and import your old databases

```bash
conda activate my_new_env
DRAM-setup.py import_config --config_loc my_old_config.txt
```

_I have access to KEGG_

Expand All @@ -56,26 +98,26 @@ DRAM-setup.py prepare_databases --output_dir DRAM_data
Similar to above you can still provide locations of databases you have already downloaded so you don't have to do it
again.

To test that your set up worked use the command `DRAM-setup.py print_config` and the location of all databases provided
To test that your set up worked use the command `DRAM-setup.py print_config` and the location of all databases provided
will be shown as well as the presence of additional annotation information.

*NOTE:* Setting up DRAM can take a long time (up to 5 hours) and uses a large about of memory (512 gb) by default. To
*NOTE:* Setting up DRAM can take a long time (up to 5 hours) and uses a large amount of memory (512 gb) by default. To
use less memory you can use the `--skip_uniref` flag which will reduce memory usage to ~64 gb if you do not provide KEGG
Genes and 128 gb if you do. Depending on the number of processors which you tell it to use (using the `--threads`
Genes and 128 gb if you do. Depending on the number of processors which you tell it to use (using the `--threads`
argument) and the speed of your internet connection. On a less than 5 year old server with 10 processors it takes about
2 hours to process the data when databases do not need to be downloaded.

## Usage
## Getting Started Part 3: Usage

Once DRAM is set up you are ready to annotate some MAGs. The following command will generate your full annotation:
Once DRAM is set up you are ready to annotate some MAGs. The following command will generate your full annotation:

```bash
DRAM.py annotate -i 'my_bins/*.fa' -o annotation
```

`my_bins` should be replaced with the path to a directory which contains all of your bins you would like to annotated and `.fa` should be replaced with the file extension used for your bins (i.e. `.fasta`, `.fna`, etc). If you only need to annotated a single genome (or an entire assembly) a direct path to a nucleotide fasta should be provided. Using 20 processors DRAM.py takes about 17 hours to annotate ~80 MAGs of medium quality or higher from a mouse gut metagenome.
`my_bins` should be replaced with the path to a directory which contains all of your bins you would like to annotated and `.fa` should be replaced with the file extension used for your bins (i.e. `.fasta`, `.fna`, etc). If you only need to annotate a single genome (or an entire assembly) a direct path to a nucleotide fasta should be provided. Using 20 processors, DRAM.py takes about 17 hours to annotate ~80 MAGs of medium quality or higher from a mouse gut metagenome.

In the output `annotation` folder there will be various files. `genes.faa` and `genes.fna` are fasta files with all genes called by prodigal with additional header information gained from the annotation as nucleotide and amino acid records respectively. `genes.gff` is a GFF3 with the same annotation information as well as gene locations. `scaffolds.fna` is a collection of all scaffolds/contigs given as input to `DRAM.py annotate` with added bin information in the headers. `annotations.tsv` is the most important output of the annotation. This includes all annotation information about every gene from all MAGs. Each line is a different gene and each column contains annotation information. `trnas.tsv` contains a summary of the tRNAs found in each MAG.
In the output `annotation` folder, there will be various files. `genes.faa` and `genes.fna` are fasta files with all genes called by prodigal with additional header information gained from the annotation as nucleotide and amino acid records respectively. `genes.gff` is a GFF3 with the same annotation information as well as gene locations. `scaffolds.fna` is a collection of all scaffolds/contigs given as input to `DRAM.py annotate` with added bin information in the headers. `annotations.tsv` is the most important output of the annotation. This includes all annotation information about every gene from all MAGs. Each line is a different gene and each column contains annotation information. `trnas.tsv` contains a summary of the tRNAs found in each MAG.

Then after your annotation is finished you can summarize these annotations with the following command:

Expand All @@ -85,15 +127,17 @@ DRAM.py distill -i annotation/annotations.tsv -o genome_summaries --trna_path an
This will generate the distillate and liquor files.

## System Requirements
DRAM has a large memory burden and is designed to be run on high performance computers. DRAM annotates against a large
variety of databases which must be processed and stored. Setting up DRAM with KEGG Genes and UniRef90 will take up ~500
GB of storage after processing and require ~512 GB of RAM while using KOfam and skipping UniRef90 will mean all
processed databases will take up ~30 GB on disk and will only use ~128 GB of RAM while processing. DRAM annotation
memory usage depends on the databases used. When annotating with UniRef90 around 220 GB of RAM is required. If the KEGG
gene database has been provided and UniRef90 is not used then memory usage is around 100 GB of RAM. If KOfam is used to
annotate KEGG and UniRef90 is not used then less than 50 GB of RAM is required. DRAM can be run with any number of
DRAM has a large memory burden and is designed to be run on high performance computers. DRAM annotates against a large
variety of databases which must be processed and stored. Setting up DRAM with KEGG Genes and UniRef90 will take up ~500
GB of storage after processing and require ~512 GB of RAM while using KOfam and skipping UniRef90 will mean all
processed databases will take up ~30 GB of disk and will only use ~128 GB of RAM while processing. DRAM annotation
memory usage depends on the databases used. When annotating with UniRef90, around 220 GB of RAM is required. If the KEGG
gene database has been provided and UniRef90 is not used, then memory usage is around 100 GB of RAM. If KOfam is used to
annotate KEGG and UniRef90 is not used, then less than 50 GB of RAM is required. DRAM can be run with any number of
processors on a single node.

## Citing DRAM
The DRAM was published in Nucleic Acids Research in 2020 and is availabe [here](https://academic.oup.com/nar/article/48/16/8883/5884738). If
DRAM helps you out in your research please cite it.
The DRAM was published in Nucleic Acids Research in 2020 and is available [here](https://academic.oup.com/nar/article/48/16/8883/5884738). If
DRAM helps you out in your research, please cite it.


Loading