This repo hosts the complete PHASTER dataset from the 59 strains’ genomes analyzed on PHASTER and several scripts developed for its processing. Analysis of this processed data is provided in an on preparation manuscript. The primary packages used here are in the Tidyverse set of libraries, the rmarkdown and knitr packages were used to generate this report (Xie 2021; Allaire et al. 2021; Wickham et al. 2019).
The first step in this analysis is importing the summarized output that
was created using the populate_script()
after selecting each prophage
prediction using de datapasta
package (McBain et al. 2021).
Predicted prophages retrieved on 2021-02-19
df_summaries <- read_tsv("Data/Raw/2021-02-19_PHASTER-summaries.tsv")
…
Data from the the Biosynthetic Gene Clusters (BGCs) were retrieved for each species from different servers including antiSMASH v5.1.2 and a manual curation.
Prophage completeness code.
by_completeness_raw <- df_summaries %>%
group_by(species,genome) %>%
summarise(
Incomplete = sum(str_count(completeness, "incomplete")),
Questionable = sum(str_count(completeness, "questionable")),
Intact = sum(str_count(completeness, "intact"))
)
write_tsv(x = by_completeness_raw, "Data/Raw/completeness-01.tsv")
…
RM system count import.
rm_raw <- read_csv("Data/Raw/RMsystems-01.csv")
…
Finally all data was processed by adding extra information necessary for applying as an extra layer of phylogenetic information:
�[01;34mData/Processed�[00m
├── BGCs-iTOL-04.txt
├── Completeness-iTOL-05.txt
└── RM-iTOL-01.txt
0 directories, 3 files
For instance the BGCs data layer has the following information:
DATASET_BINARY
SEPARATOR TAB
DATASET_LABEL Metabolites-04
COLOR #ff0000
FIELD_SHAPES 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
STRIP_WIDTH 145
LABEL_SIZE_FACTOR 1
FIELD_LABELS Fengycins Iturins Surfactins Bacillibactin Bacilysin Anabaenopeptin Rhizocticin A Macrolactin H Bacillaene Difficidin Zwittermicin A Subtilin Mersacidin Sublancin 168 Lichenicidin Subtilosin A Plantazolicin SKF Paeninodin Subtilomycin Thiociclin I Petrobactin
COLOR_MIN #a4ffa4
COLOR_MID #61b368
COLOR_MAX #007500
For the purpose of wrap viral species into families we created this table assigning the current virus on the data set to its associated family:
Prophage families population.
phage_family <- tribble(
~most_common_phage, ~phage_family,
'Brevibacillus Jimmer1' , 'Myoviridae',
'Brevibacillus Jimmer2' , 'Myoviridae',
'Brevibacillus Abouo', 'Myoviridae',
'Brevibacillus Jenst' , 'Siphoviridae',
'Brevibacillus Osiris', 'Myoviridae',
'Paenibacillus HB10c2' , 'Siphoviridae',
'Paenibacillus Harrison' , 'Siphoviridae',
'Paenibacillus Tripp' , 'Siphoviridae',
'Deep s_D6E' , 'Myoviridae',
'Listeria 2389' , 'Siphoviridae',
'Listeria A006' , 'Siphoviridae',
'Listeria vB_LmoS_188', 'Siphoviridae',
'Listeria B054', 'Myoviridae',
'Thermus phi OH2' , 'Myoviridae',
'Geobacillus E3', 'Siphoviridae',
'Geobacillus GBSV1' , 'Myoviridae',
'Staphylococcus vB_SepS_SEP9' , 'Siphoviridae',
'Staphylococcus SPbeta_like' , 'Siphoviridae',
'Clostridium phiCT9441A' , 'Myoviridae',
'Clostridium phiCD27' , 'Myoviridae',
'Clostridium phiMMP02' , 'Myoviridae',
'Clostridium phiCD505', 'Myoviridae',
'Clostridium phiCT453A', 'Myoviridae',
'Clostridium phiSM101' , 'Siphoviridae',
'Clostridium c_st', 'Siphoviridae',
'Clostridium phiCD111', 'Siphoviridae',
'Clostridium phiCD211', 'Siphoviridae',
'Bacillus SPP1' , 'Siphoviridae',
'Bacillus phi105' , 'Siphoviridae',
'Bacillus virus 1' , 'Myoviridae',
'Bacillus Bam35c' , 'Tectiviridae',
'Bacillus phBC6A51' , 'Siphoviridae',
'Bacillus phBC6A52' , 'Siphoviridae',
'Bacillus PM1' , 'Siphoviridae',
'Bacillus WBeta' , 'Siphoviridae',
'Bacillus phIS3501' , 'Siphoviridae',
'Bacillus BtCS33' , 'Siphoviridae',
'Bacillus SPbeta' , 'Siphoviridae',
'Bacillus phi4J1' , 'Siphoviridae',
'Bacillus BalMu1' , 'Myoviridae',
'Bacillus AR9', 'Myoviridae',
'Bacillus Bam35c', 'Tectiviridae',
'Bacillus Bobb' , 'Herelleveridae',
'Bacillus BtCS33', 'Siphoviridae',
'Bacillus Eyuki' , 'Herelleveridae',
'Bacillus G', 'Myoviridae',
'Bacillus IEBH', 'Siphoviridae',
'Bacillus JBP901', 'Herelleveridae',
'Bacillus JL', 'Herelleveridae',
'Bacillus Palmer', 'Podoviridae',
'Bacillus PBS1', 'Myoviridae',
'Bacillus PfEFR_5', 'Siphoviridae',
'Bacillus phiCM3', 'Siphoviridae',
'Bacillus phiNIT1', 'Herelleveridae',
'Bacillus Pony', 'Herelleveridae',
'Bacillus SP_10', 'Herelleveridae',
'Bacillus vB_BhaS_171', 'Siphoviridae',
'Bacillus Waukesha92' , 'Siphoviridae',
'Bacillus Finn', 'Siphoviridae',
'Bacillus Fah', 'Siphoviridae',
'Enterobacteria phi92', 'Myoviridae',
'Enterococcus phiEf11', 'Siphoviridae',
'Enterobacteria vB_KleM_RaK2', 'Myoviridae',
'Lactobacillus JCL1032', 'Siphoviridae',
'Lactobacillus phiAT3', 'Siphoviridae',
'Planktothrix PaV_LD', 'Siphoviridae',
'Sphingomonas PAU', 'Myoviridae',
'Cellulophaga phiSM', 'Myoviridae',
'Staphylococcus SpaA1' , 'Siphoviridae',
'Synechococcus S_SKS1' , 'Myoviridae',
'Escherichia RCS47' , 'Myoviridae'
)
…
most\_common\_phage | phage\_family |
---|---|
Brevibacillus Jimmer1 | Myoviridae |
Brevibacillus Jimmer2 | Myoviridae |
Brevibacillus Abouo | Myoviridae |
Brevibacillus Jenst | Siphoviridae |
Brevibacillus Osiris | Myoviridae |
Paenibacillus HB10c2 | Siphoviridae |
Paenibacillus Harrison | Siphoviridae |
Paenibacillus Tripp | Siphoviridae |
Deep s\_D6E | Myoviridae |
Listeria 2389 | Siphoviridae |
The previous table is then joined to the main dataset and complements it. To find the prophage candidates from PHASTER data we evaluated the summary information. In this approach the most common phage is directly detected from PHASTER and its criteria is based on adding unidentified prophage proteins to the PHAGE with most proteins already identified using sequence similarity of the non identified proteins.
Intact prophages per bacteria
intact_phages_summary_cleaned <- df_summaries %>%
separate(most_common_phage, c("most_common_phage", "phage_proteins"), sep = "\\(") %>%
relocate(species) %>%
arrange(species) %>%
mutate(
phage_proteins = str_replace(phage_proteins, "\\)", ""),
most_common_phage = str_remove(most_common_phage, "PHAGE_"),
species = str_replace(species, "Bacillus", "B."),
species = as.factor(species),
phage_proteins = as.numeric(phage_proteins),
most_common_phage = str_replace(most_common_phage, "_", " "),
most_common_phage = str_replace(most_common_phage, "Bacill", "Bacillus"),
most_common_phage = str_replace(most_common_phage, "Brevib", "Brevibacillus"),
most_common_phage = str_replace(most_common_phage, "Clostr", "Clostridium"),
most_common_phage = str_replace(most_common_phage, "Thermu", "Thermus phi"),
most_common_phage = str_replace(most_common_phage, "Paenib", "Paenibacillus"),
most_common_phage = str_replace(most_common_phage, "Bacillus 1", "Bacillus virus 1"),
most_common_phage = str_replace(most_common_phage, "Lister", "Listeria"),
most_common_phage = str_replace(most_common_phage, "Geobac", "Geobacillus"),
most_common_phage = str_replace(most_common_phage, "Staphy", "Staphylococcus"),
most_common_phage = str_replace(most_common_phage, "Entero phi92", "Enterobacteria phi92"),
most_common_phage = str_replace(most_common_phage, "Entero phiEf11", "Enterococcus phiEf11"),
most_common_phage = str_replace(most_common_phage, "Entero vB_KleM_RaK2", "Enterobacteria vB_KleM_RaK2"),
most_common_phage = str_replace(most_common_phage, "Lactob", "Lactobacillus"),
most_common_phage = str_replace(most_common_phage, "Sphing", "Sphingomonas"),
most_common_phage = str_replace(most_common_phage, "Cellul", "Cellulophaga"),
most_common_phage = str_replace(most_common_phage, "Plankt", "Planktothrix"),
most_common_phage = str_replace(most_common_phage, "Synech", "Synechococcus"),
most_common_phage = str_replace(most_common_phage, "Escher", "Escherichia")
) %>%
separate(most_common_phage, c("most_common_phage", "phage_nc_id"), sep = "_NC_") %>%
inner_join(phage_family)
…
Using the organized data from intact prophages we then plot the best prediction per bacteria and their associated CDSs.
Bubbleplot of intact prophages per bacteria
bubbleplot_plot_summary <- intact_phages_summary_cleaned %>%
filter(completeness == "intact") %>%
select(species, most_common_phage, phage_proteins, phage_family) %>%
mutate(epithet = as_factor(word(species,2))) %>%
pivot_wider(names_from = "most_common_phage", values_from = "phage_proteins", values_fill = 0, values_fn = sum) %>%
pivot_longer(cols = -c(species, epithet, phage_family), names_to = "most_common_phage", values_to = "phage_proteins") %>%
filter(phage_proteins > 0) %>%
ggplot(aes(most_common_phage, species, size = phage_proteins, color = phage_family)) +
geom_point() +
facet_grid(rows = vars(epithet), scales = "free", space = "free") +
theme_bw() +
scale_size_area(max_size = 8) +
scale_color_brewer(palette = "Dark2") +
facet_grid(rows = vars(epithet), cols = vars(phage_family), scales = "free", space = "free") +
guides(color = guide_legend(override.aes = list(size=9))) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 13, face = "italic"),
axis.text.y = element_text(size = 13, face = "italic"),
axis.title.x = element_text(size = 16, face = "bold"),
axis.title.y = element_text(size = 16, face = "bold"),
strip.background = element_blank(),
strip.text.y = element_blank(),
strip.text.x = element_blank(),
legend.position = "right"
) +
labs(
x = "Prophage candidates",
y = "Bacterial species",
size = "CDS hits",
color = "Phage family"
)
# bubbleplot_plot_summary
…
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.
Aust, Frederik. 2019. Citr: RStudio Add-in to Insert Markdown Citations. https://github.com/crsh/citr.
Bache, Stefan Milton, and Hadley Wickham. 2020. Magrittr: A Forward-Pipe Operator for r. https://CRAN.R-project.org/package=magrittr.
McBain, Miles, Jonathan Carroll, Sharla Gelfand, Suthira Owlarn, and Garrick Aden-Buie. 2021. Datapasta: R Tools for Data Copy-Pasta. https://github.com/milesmcbain/datapasta.
Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Ushey, Kevin. 2021. Renv: Project Environments. https://rstudio.github.io/renv/.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
———. 2021a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2021b. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
———. 2021c. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Jim Hester. 2021. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.