Discover and annotate the virome.
Works on your laptop or HPC (compatible with MacOS and Linux)
Cenote-Taker 3
is a virus bioinformatics tool that scales from individual genomes sequences to massive metagenome assemblies to:
-
Identify sequences containing genes specific to viruses (virus hallmark genes)
-
Annotate virus sequences including:
---a) adaptive ORF calling
---b) a large catalog of HMMs from virus gene families for functional annotation
---c) Hierarchical taxonomy assignment based on hallmark genes
---d) mmseqs2-based CDD database search
---e) tabular (.tsv) and interactive genome map (.gbf) outputs
Also, Cenote-Taker 3
is very fast, many many times faster than Cenote-Taker 2
for large datasets, and faster than comparable annotation using pharokka
with more function annotation for virus genes (in my hands)
Image of example genome map:
-
Discovering virus contigs in metagenomic data
-
Annotating virus sequences without highly similar well-annotated reference
-
Finding prophages (or proviruses) in microbial genomes
-
Not for read-level classification of known viruses (see Marker-MAGu or EsViritu for this task)
-
Not ideal for annotating virus genomes that are highly similar to known references (e.g. phage lambda with a few mutations).
Most recent versions
Cenote-Taker 3 scripts: v3.3.2
Cenote-Taker 3 Databases: v3.1.1
This should work on MacOS and Linux
Versions used in test installations
mamba 1.5.8
conda 24.7.1
mamba
is better/faster than conda
for almost all solving/installation tasks
- Use
mamba
to install the bioconda package
macOS (specify osx-64
platform regardless of which chip you have)
mamba create --platform osx-64 -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
linux
mamba create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
Using conda instead
macOS (specify osx-64
platform regardless of which chip you have)
conda create --platform osx-64 -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
linux
conda create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.3.2
- Activate the conda environment.
conda activate ct3_env
You should be able to type cenotetaker3
and get_ct3_dbs
in terminal to bring up help menu now
- Change to a directory where you'd like to install databases and run database script, specify DB directory with
-o
.
Total DB file size of 3.0 GB after file decompression
cd ..
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
With optional hhsuite databases
Warning: due to inconsistent server speed, these downloads may take over 2 hours.
You may download one or more hhsuite DB.
The data footprint is:
Database | Size |
---|---|
CDD | 6.1 GB |
pfam | 4.6 GB |
pdb70 | 56 GB |
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T --hhCDD T --hhPFAM T --hhPDB T
- Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
-
Clone this GitHub repo
-
Using
mamba
(package manager withinconda
) and the provided yaml file, make the environment:
mamba env create -f Cenote-Taker3/environment/ct3_env.yaml
- Activate the conda environment.
conda activate ct3_env
- Change to repo and
pip
install command line tool.
cd Cenote-Taker3
pip install .
You should be able to type cenotetaker3
and get_ct3_dbs
in terminal to bring up help menu now
- Change to a directory where you'd like to install databases and run database script, specify DB directory with
-o
.
Total DB file size of 3.0 GB after file decompression
cd ..
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
With optional hhsuite databases
Warning: due to inconsistent server speed, these downloads may take over 2 hours.
You may download one or more hhsuite DB.
The data footprint is:
Database | Size |
---|---|
CDD | 6.1 GB |
pfam | 4.6 GB |
pdb70 | 56 GB |
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T --hhCDD T --hhPFAM T --hhPDB T
- Set the database directory as a conda environmental variable.
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
Make sure conda environment is activated
cenotetaker3 -h
cenotetaker3 -c Cenote-Taker3/test_data/testcontigs_DNA_ct2.fasta -r test_ct3 -p T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T --lin_minimum_hallmark_genes 2
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3pr -p T --caller prodigal
cenotetaker3 -c my_virus_contigs.fna -r my_virs_ct3 -p F -am T
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T -db virion rdrp dnarep
cenotetaker3 -c my_metagenome_contigs.fna -r my_meta_ct3 -p T --reads my_reads/*fastq
{run_title}/ | {run_title}_virus_summary.tsv <- main summary file for each virus | {run_title}_virus_sequences.fna <- all virus genome seqs | {run_title}_virus_AA.faa <- all virus AA seqs | {run_title}_prune_summary.tsv <- summary of pruning of each sequence | final_genes_to_contigs_annotation_summary.tsv <- annotation info, all genes | run_arguments.txt <- arguments used in this run │ {run_title}_cenotetaker.log <- main log file │ └───sequin_and_genome_maps/ │ │ {run_title}*gbf <- genome maps │ │ {run_title}*fsa <- genome sequence │ │ {run_title}*gtf <- feature table gtf format │ │ {run_title}*tbl <- feature table sequin format │ │ {run_title}*sqn <- non-human-readable sequin file for GenBank sub │ │ {run_title}*cmt <- sequin comment file │ └───ct_processing/ │ --- many intermediate files ---
CheckV for virus genome completeness estimation.
BACPHLIP for phage lifestyle prediction (only use complete/near-complete phage genomes).
VContact3 for genome clustering and taxonomy.
iPHoP for prokaryotic virus host prediction.
Cenote-Taker 3
is under active development, so please open an issue if anything seems unusual or any errors occur. It's likely that I've not tested every parameter combination, and bugs will be a simple fix.
- instructions for manual curation -> GenBank deposit of
Cenote-Taker 3
output