bugphyzz is a database that harmonizes physiological and other microbial trait annotations from different sources using a controlled vocabulary and ontology terms. Furthermore, these annotations are propagated to uncharacterized microbes through Ancestral State Reconstruction (ASR).
You can learn more about this project here.
This repository contains the code for resolving conflicting annotations
and run the ASR step. It also contains the devel version of the annotations
(before being released on Zenodo) distributed across different text files.
The *.csv files contain the data in tabular format and are imported through
the bugphyzz::importBugphyzz
function in R.
The *gmt files contain lists of microbial signatures in GMT format
created with the bugphyz::makeSignatures
function.
The data schema is described here
The devel files are generated weekly.
If desired, anyone can generate the *.csv and *.gmt files.
The first step is downloading the repo:
git clone https://github.com/waldronlab/bugphyzzExports.git
cd bugphyzzExports
The following packages need to be installed in the R environment:
This could be accomplished for example with:
## Inside an R session
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
dependencies <- c(
"waldronlab/bugphyzz",
"bugsigdbr",
"castor",
"dplyr",
"logr",
"phytools"
"purrr",
"rlang",
"sessioninfo",
"stringr",
"waldronlab/taxPPro",
"tibble",
"tidyr"
)
BiocManager::install(dependencies)
Or running devtools::install_deps(dependencies=TRUE)
in an R session within
the main directory.
Run the script, which will produce the files in the directory where the script is run. Preferably run inside the main directory of the project.
On a linux-like terminal:
Rscript inst/scripts/export_bugphyzz.R
On supermicro (for internal use):
/usr/bin/Rscript --vanilla inst/scripts/export_bugphyzz.R
The files are available under Creative Commons Attribution 4.0 International.
Find this dataset on Zenodo (latest realease version): https://zenodo.org/doi/10.5281/zenodo.10980653
Some recommendations about versioning for relase.
Format: x.y.z
Example: 1.0.2
The third digit (z) should be used to fix typos or any other minor trouble with the annotations. Essentially these are the same annotations, but with minor adjustments.
The second digit (y) should be used for major adjustments such as fixing the way conflicting annotations are handled or adjusting ASR methods/parameters, say choosing a different phylogenetic tree or using a different package for running ASR.
The first digit (x) should be reserved for major changes, such as adding new datasets or using a completely different approach for propagating annotations, etc.
A 10-fold cross-validation approach was used to estimate how good our ASR method did with each attribute/physiology in the dataset. These validation results are not really part of the annotations, so they're not provided here. You can find these results on: https://github.com/waldronlab/taxPProValidation/. to select the attributes with the best results. The validation values are also attached when importing the files with the bugphyzz package in R.