Releases: jolespin/veba
VEBA_v2.3.0
- [2024.9.21] - Added
KEGG Pathway Profiler
toVEBA-database_env
andVEBA-annotate_env
which replacesMicrobeAnnotator-KEGG
for module completion ratios. Replacing${VEBA_DATABASE}/Annotate/MicrobeAnnotator-KEGG
with${VEBA_DATABASE}/Annotate/KEGG-Pathway-Profiler/
database files. Note: New module completion ratio output does not have classes labels for KEGG modules. - [2024.8.30] - Added ${N_JOBS} to download scripts with default set to maximum threads available
VEBA_v2.2.1
- [2024.8.29] - Added
VERSION
file created indownload_databases.sh
- [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added
--af_mode
with eitherrelaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af
orstrict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af)
toedgelist_to_clusters.py
,global_clustering.py
,local_clustering.py
, andcluster.py
. - [2024.7.3] - Added
pigz
toVEBA-annotate_env
which isn't a problem with mostconda
installations but needed fordocker
containers. - [2024.6.21] - Changed
choose_fastest_mirror.py
todetermine_fastest_mirror.py
- [2024.6.20] - Added
-m/--include_mrna
tocompile_metaeuk_identifiers.py
for Issue #110
VEBA_v2.2.0
Disclaimer:
I made some large updates in this version and I believe everything has been adequately tested but just in case anything has slipped between the cracks you can use v2.1.0 which has been thoroughly tested in accordance to the NAR Espinoza 2024 paper. Benefits of using this version include much faster and robust prokaryotic classifications and fast/scalable HMM-based annotation modeling.
Large performance updates for this version including:
- Updating GTDB-Tk 2.3.0 -> 2.4.0 which means the GTDB needed to be updated from r214.1 -> r220
- VEBA-classify_env was split up into VEBA-classify-eukaryotic_env, VEBA-classify-prokaryotic_env, and VEBA-prokaryotic_env
- annotate.py, classify-eukaryotic.py, and phylogeny.py were rewritten (and their utility scripts) were updated to used PyHMMER (pyhmmsearch and pykofamsearch) which is faster than HMMSearch when multithreaded.
- KOFAM was changed to KOfam
NOTE: Please don't use the tar.gz as it contains the 2.1.0 version for some reason:
VERSION="2.2.0"
# wget https://github.com/jolespin/veba/archive/refs/tags/v${VERSION}.tar.gz # The .tar.gz is out of date in this release
# tar -xvf v${VERSION}.tar.gz && mv veba-${VERSION} veba
# Alternative download
wget https://github.com/jolespin/veba/releases/download/v${VERSION}/v${VERSION}.zip
unzip -d veba v${VERSION}.zip
VEBA_v2.1.0-zen
This is the exact same version as VEBA_v2.1.0. New VEBA releases will now automatically be synced to Zenodo.
VEBA_v2.1.0
Official release of VEBA v2.1.0 with updates to address peer reviewers. Mostly documentation but also including the following:
- [2024.4.30] - Added
concatenate_files.py
which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g.,cat *.fasta > output.fasta
where *.fasta results in 50k files will crash) - [2024.4.29] - Added
/volumes/workspace/
directory to Docker containers for situations when your input and output directories are the same. - [2024.4.29] -
featureCounts
can only handle 64 threads at a time so addedmin(64, opts.n_jobs)
for all the modules/scripts that usefeatureCounts
commands. - [2024.4.23] - Added
uniprot_to_enzymes.py
which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A* - [2024.4.18] - Developed a faster CLI implementation of
KofamScan
calledPyKofamSearch
which leveragePyHmmer
. This will be used in future versions of VEBA. - [2024.4.18] - Developed a faster CLI implementation of
HMMSearch
calledPyHMMSearch
which leveragePyHmmer
. This will be used in future versions of VEBA. - [2024.3.26] - Added
--metaeuk_split_memory_limit
tometaeuk_wrapper.py
. - [2024.3.26] - Added
-d/--genome_identifier_directory_index
toscaffolds_to_bins.py
for directories that are structuredpath/to/genomes/bin_a/reference.fasta
where you would use-d -2
. - [2024.3.26] - Added
--minimum_af
toedgelist_to_clusters.py
with an option to accept 4 column inputs[id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]
.global_clustering.py
,local_clustering.py
, andcluster.py
now use this by default--af_threshold 30.0
. If you want to retain previous behavior, just use--af_threshold 0.0
. - [2024.3.18] -
edgelist_to_clusters.py
only includes edges where both nodes are in identifiers set. If--identifiers
are provided, then only those identifiers are used. If not, then it includes all nodes. - [2024.3.18] - Added
--export_representatives
argument foredgelist_to_clusters.py
to output table with[id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]
. Also includes this information innx.Graph
objects. - [2024.3.18] - Changed singleton weight to
np.nan
instead ofnp.inf
foredgelist_to_clusters.py
to allow for representative calculations. - YouTube channel (https://www.youtube.com/@VEBA-Multiomics)
VEBA_v2.1.0b (pre-release)
Beta release of VEBA v2.1.0b with updates to address peer reviewers. Mostly documentation but also including the following:
- [2024.4.30] - Added
concatenate_files.py
which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g.,cat *.fasta > output.fasta
where *.fasta results in 50k files will crash) - [2024.4.29] - Added
/volumes/workspace/
directory to Docker containers for situations when your input and output directories are the same. - [2024.4.29] -
featureCounts
can only handle 64 threads at a time so addedmin(64, opts.n_jobs)
for all the modules/scripts that usefeatureCounts
commands. - [2024.4.23] - Added
uniprot_to_enzymes.py
which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A* - [2024.4.18] - Developed a faster implementation of
KofamScan
calledPyKofamSearch
which leveragePyHmmer
. This will be used in future versions of VEBA. - [2024.3.26] - Added
--metaeuk_split_memory_limit
tometaeuk_wrapper.py
. - [2024.3.26] - Added
-d/--genome_identifier_directory_index
toscaffolds_to_bins.py
for directories that are structuredpath/to/genomes/bin_a/reference.fasta
where you would use-d -2
. - [2024.3.26] - Added
--minimum_af
toedgelist_to_clusters.py
with an option to accept 4 column inputs[id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]
.global_clustering.py
,local_clustering.py
, andcluster.py
now use this by default--af_threshold 30.0
. If you want to retain previous behavior, just use--af_threshold 0.0
. - [2024.3.18] -
edgelist_to_clusters.py
only includes edges where both nodes are in identifiers set. If--identifiers
are provided, then only those identifiers are used. If not, then it includes all nodes. - [2024.3.18] - Added
--export_representatives
argument foredgelist_to_clusters.py
to output table with[id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]
. Also includes this information innx.Graph
objects. - [2024.3.18] - Changed singleton weight to
np.nan
instead ofnp.inf
foredgelist_to_clusters.py
to allow for representative calculations.
VEBA_v2.0.0
- Changed default assembly algorithm to
metaflye
instead offlye
inassembly-long.py
- Added
number_of_genomes
,number_of_genome-clusters
,number_of_proteins
, andnumber_of_protein-clusters
tofeature_compression_ratios.tsv.gz
fromcluster.py
- Added
-A/--from_antismash
inbiosynthetic.py
to use preexistingantiSMASH
results. Also changed-i/--input
to-i/--from_genomes
. - Changed
antimash_genbanks_to_table.py
tobiosynthetic_genbanks_to_table.py
for future support ofDeepBGC
andGECCO
- Added
busco_version
parameter tomerge_busco_json.py
with default set to5.4.x
and additional support for5.6.x
. - Added
CONDA_ENVS_PATH
toupdate_environment_scripts.sh
,update_environment_variables.sh
, andcheck_installation.sh
- Added
CONDA_ENVS_PATH
toveba
to allow for custom environment locations - Changed
install.sh
to support customCONDA_ENVS_PATH
argumentbash install.sh path/to/log path/to/envs/
- Added
merge_counts_with_taxonomy.py
VEBA_v1.5.0
Warning:
For this release, use the https://github.com/jolespin/veba/releases/download/v1.5.0/v1.5.0.zip
asset not the "Source code" assets as those are out of date.
Release v1.5.0 Highlights:
- Added
VeryFastTree
tophylogeny.py
- Added
--blacklist
tocompile_eukaryotic_classifications.py
- Added compatibility for
antismash_genbanks_to_table.py
to operate onantiSMASH v7
genbanks - Added
compile_phylogenomic_functional_categories.py
script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239) - Fixed error in
annotations.protein_clusters.tsv
formatting fromannotate.py
- Fixed situation where
unbinned.fasta
were not added inbinning-prokaryotic.py
and bad symlinks were created for GFF, rRNA, and tRNA when no genoems were detected. - Fixed critical error where
classify_eukaryotic.py
was trying to access a deprecated database file from MicroEuk_v2.
Release v1.5.0 Details
- Cleaned up installation files
- Changed
veba/src/
toveba/bin/
- Checked
SCRIPT_VERSIONS
toVEBA_SCRIPT_VERSIONS
which are now inbin/
of conda environment - Fixed header being offset in
annotations.protein_clusters.tsv
where it could not be read with Pandas. - Fixed
binning-prokaryotic.py
the creation of non-existing symlinks where "'.gff'", "'.rRNA'", and "'*.tRNA'" were created. - Fixed .strip method on Pandas series in
antismash_genbanks_to_table.py
for compatibilty withantiSMASH 6 and 7
- Fixed situation where
unbinned.fasta
is empty inbinning-prokaryotic.py
when there are no bins that pass qc. - Fixed minor error in
coverage.py
wheresamtools sort --reference
was gettingreads_table.tsv
and notreference.fasta
- Changed default behavior from deterministic to not deterministic for increase in speed in
assembly-long.py
. (i.e.,--no_deterministic
-->--deterministic
) - Added
VeryFastTree
as an option tophylogeny.py
withFastTree
remaining as the default. - Changed default
--leniency
parameter onclassify_eukaryotic.py
andconsensus_genome_classification_ranked.py
to1.0
and added--leniecy_genome_classification
as a separate option. - Added
--blacklist
option tocompile_eukaryotic_classifications.py
with a default value ofspecies:uncultured eukaryote
inclassify_eukaryotic.py
- Fixed critical error where
classify_eukaryotic.py
was trying to access a deprecated database file from MicrEuk_v2. - Fixed minor error with
eukaryotic_gene_modeling_wrapper.py
not allowing forTiara
to run in backend. - Added
compile_phylogenomic_functional_categories.py
script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)
VEBA_v1.4.2
- [2023.12.21] -
GTDB-Tk
changed name of archaea summary file so VEBA was not adding this to final classification. Fixed this inclassify-prokaryotic.py
. - [2023.12.20] - Fixed files not being closed in
compile_custom_humann_database_from_annotations.py
and added options to use different annotation file formats (i.e., multilevel, header, and no header).
VEBA_v1.4.1
Release v1.4.1 Highlights:
-
VEBA
Modules:- Added
profile-taxonomic.py
module which usessylph
to build a sketch database for genomes and queries the genome database for taxonomic abundance. - Added long read support for
fastq_preprocessor
,preprocess.py
,assembly-long.py
,coverage-long
, and all binning modules. - Redesign
binning-eukaryotic
module to handle customMetaEuk
databases - Added new usage syntax
veba --module preprocess --params “${PARAMS}”
where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change. - Added
skani
which is the new default for genome-level clustering based on ANI. - Added
Diamond DeepClust
as an alternative toMMSEQS2
for protein clustering.
- Added
-
VEBA
Database (VDB_v6
):-
Completely rebuilt
VEBA's Microeukaryotic Protein Database
to produce a clustered databaseMicroEuk100/90/50
similar toUniRef100/90/50
. Available on doi:10.5281/zenodo.10139450. -
Number of sequences:
- MicroEuk100 = 79,920,431 (19 GB)
- MicroEuk90 = 51,767,730 (13 GB)
- MicroEuk50 = 29,898,853 (6.5 GB)
-
Number of source organisms per dataset:
- MycoCosm = 2503
- PhycoCosm = 174
- EnsemblProtists = 233
- MMETSP = 759
- TARA_SAGv1 = 8
- EukProt = 366
- EukZoo = 27
- TARA_SMAGv1 = 389
- NR_Protists-Fungi = 48217
-