Skip to content

Using R to analyze your SQM results

Natalia García García edited this page Nov 12, 2021 · 29 revisions

You have just run SqueezeMeta and now you are ready to analyze the results. But... there are too many useful tables! How can you face them? Try the R SQMtools package to explore with ease your results with R!

Basic workflow of SQMtools package.

All the functions and options are fully explained in the manual of this package, but for simplicity we provide some examples here.

We have used 16 samples from Hadza hunter-gatherer (8) and italians (8) gut microbiomes (Rampelli, S., Schnorr, S. L., Consolandi, C., Turroni, S., Severgnini, M., Peano, C., ... & Candela, M. (2015). Metagenome sequencing of the Hadza hunter-gatherer gut microbiota. Current Biology, 25(13), 1682-1693).

NOTE:
Don't forget to change the paths depending on your own directory structure!


1. Loading your data

First of all, load the SQMtools R package and your data:

library('SQMtools')
Hadza = loadSQM('/media/disk6/natalia/Hadza16')
# Now, there is a faster option to load your data if your project is large:
Hadza = loadSQM('/media/disk6/natalia/Hadza16', engine = 'data.table')

This function creates a SQM object that gathers all the most valuable information of the SQM project. The structure of this object is shown down below. To access to all this data you would have need to open and parse at least three different tables!

SQM_object.png

SQM Object structure

Given that this object is a list, use the next syntax to access to any vector, dataframe or matrix:
data_analyzed = $lvl1$lvl2$lvl3

Some examples to help you:

# Matrix genera vs. abundances
genus_tax=Hadza$taxa$genus$abund
genus_tax[1:5, 1:5]
                                        H1 H10 H11 H12 H13
Unclassified Candidatus Aenigmarchaeota 46  13   5   1   0
Unclassified Candidatus Altiarchaeota    0   0   0   0   2
Unclassified Candidatus Bathyarchaeota  10 238  74  34  17
Unclassified Candidatus Korarchaeota     8   0   0   0   0
Unclassified Candidatus Lokiarchaeota    4  30   8   5  24
# Matrix with COGs vs. abundances
COG_table=Hadza$functions$COG$abund
COG_table[1:5, 1:5]
         H1   H10  H11  H12  H13
COG0001 2560  9376 2613 1382 2306
COG0002 7475 14024 3281 2022 6014
COG0003   48    95   22    3    4
COG0004 1446  4754 1111  652 1670
COG0005 3724 10696 2625 1592 5021
# Raw table from SQM: 13.<SQM project name>.orftable
orf_table=Hadza$orfs$table
orf_table[1:5, 1:5]
                  Contig ID Molecule   Method Length NT Length AA
megahit_1_116-250 megahit_1      CDS Prodigal       136        46
megahit_2_3-284   megahit_2      CDS Prodigal       283        95
megahit_3_2-268   megahit_3      CDS Prodigal       268        90
megahit_4_1-261   megahit_4      CDS Prodigal       262        88
megahit_5_3-620   megahit_5      CDS Prodigal       619       207

Then, you can always make a summary of the project to discard occasional problems.

sum_Hadza = summary(Hadza)
sum_Hadza$project_name
[1] "Hadza16"
sum_Hadza$samples
[1] "H1"   "H10"  "H11"  "H12"  "H13"  "H15"  "H14"  "H16"  "IT1"  "IT11"
[11] "IT13" "IT14" "IT2"  "IT3"  "IT4"  "IT5"
sum_Hadza$reads_in_orfs
      H1      H10      H11      H12      H13      H15      H14      H16      IT1     IT11     IT13     IT14      IT2      IT3      IT4      IT5 
23847271 52692441 13718056  7833832 23807939 16412877 61547031 25641150  3191223  7090403  3415890  2472831 15259976 12563001  4362936 29855475

NOTE:
If you feel comfortable using R, please feel free to load your project and explore it on your way.
If not, we provide you with some functions that could facilitate the analysis:

Now, we are ready to explore our SQM project!


2. Exploring and visualizing your SQM project

2.1: Who is there?

Plot taxonomy at the phylum and family levels:

plotTaxonomy(Hadza, rank='phylum', count='percent')

Fig1_Hadza16_phylum.svg

Taxonomy barplot at the phylum level. The abundance of each taxonomic group in the whole metagenome is shown in percentages.

plotTaxonomy(Hadza, rank='family', count='percent')

Fig2_Hadza16_family.svg

Taxonomy barplot at the family level.

Note: save your plots using R functions, please check the documentation!

# Option A
#Save a svg/png file
svg('myfigure.svg') # To save a png, it would be png('myfigure.png')
plotTaxonomy(Hadza, rank='family', count='percent') #Plot
dev.off()

# Option B
#Use ggplot2
library(ggplot2)
image = plotTaxonomy(Hadza, rank='family', count='percent')
ggsave(file='myfigure.svg', plot = image, width=10, height=8) #svg
ggsave(file='myfigure.png', plot = image, width=10, height=8) #png

But don't settle for this! We provide you with some extra features to customize the colours, scale percentages, choose the most abundant, or your favourite, taxa, plot or not the 'Unclassified' (contigs or genes that have not been taxonomically annotated) or the 'Other' taxon (an artificial taxon that groups the less abundant taxa).

# Picking up taxa without rescaling and without 'Other'
plotTaxonomy(Hadza, rank='phylum', count='percent', tax = c('Proteobacteria','Actinobacteria','Spirochaetes'), rescale = F, color = c('yellowgreen','tan3','cornflowerblue'), others = F)

Fig4_Hadza16_tax.svg

Taxonomy barplot of the chosen taxa without rescaling percentages and without the 'Other' taxon.

# Choosing taxa rescaling and without 'Other'
plotTaxonomy(Hadza, rank='phylum', count='percent', tax=c('Proteobacteria','Actinobacteria','Spirochaetes'), rescale = T, color = c('yellowgreen','tan3','cornflowerblue'), others = F)

Fig5_Hadza16_taxrescale.svg

Taxonomy barplot of the chosen taxa rescaling percentages but without the 'Other' taxon.


2.2. What are they doing?

Now, we know something about the microbial diversity in the samples, but what are they doing? We should check the functional profile(KEGG, COG and PFAM annotations) of the samples:

plotFunctions(Hadza, fun_level = 'KEGG', count = 'tpm', N = 15)

Fig6_Hadza16_KEGG_TPM.svg

Most abundant KEGGs. The abundance of each function (KEGG) is counted in TPM (‘transcripts per million’), genes from each function per million genes in the whole metagenome.

You may find something useful analyzing some KEGGs with more detail. We observe that KEGG 'K07133', which is an 'Uncharacterised protein', has a higher abundance in Hadza samples. Which is the protein sequence annotated with this KEGG?

# Extract the orf annotated with that KEGG
orf = Hadza$orfs$table[Hadza$orfs$table$`KEGG ID` == 'K07133', 1:15] # see COGs, PFAM...
# Extract the sequence
sequence=Hadza$orfs$seqs[rownames(orf)]
megahit_1163508_371-1111 
"MVEKSNIDMLNRKIYSYLRDFFETEKKALLVSGARQVGKTFAIRKVGAECFADVVEFNFLNNPKYREAFKSPSDAKEILLRLSALAEKKLIPGTTLVFFDEVQECPEMVTAIKFLVEEGSYRYVMSGSLLGVELKDIRSVPVGYMAERDENYKVEGYEKGMQRFYAYLSLFTMSMLGLVVATNIFQMYLFWELVGVCSYLLIGFYYPKHAAVHASKKAFKKLELDPPSHLATRCAMGTSHSYRHHR*" 

So you can run a BLAST or try to find out more about its domains, make a multiple alignment with other proteins ... It's up to you!


2.3. Subset data

At this point, is important to mention that it is possible to subset all the results related to a specific taxon, function or bin. The result is again a SQM object with the same structure described previously, but only with the information related to the selected feature. The best of it: most of the SQMtools functions will perfectly work on it and with the information that you really want!

2.3.1. Explore a specific taxon

We decided to pick up all contigs from 'Proteobacteria' and plot their taxonomy at the genus level and their most abundant functions:

proteo=subsetTax(Hadza, 'phylum',tax='Proteobacteria', rescale_copy_number = F)
plotTaxonomy(proteo, 'genus','percent', N=10, rescale = T, others = T)

Fig12_Hadza16_proteo_Tax.svg

Taxonomy barplot at the genus level of the 'Proteobacteria' subset. The abundance percentage of Proteobacteria is relative to the whole metagenome

plotFunctions(proteo, fun_level = 'KEGG',count = 'copy_number')

Fig9_Hadza16_proteo_Fun.svg KEGG functional profile using copy number of the 'Proteobacteria' subset. The abundance of each function (KEGG) is counted in copy numbers (the average copies of each function per genome in the whole metagenome).

2.3.2. Study specific functions

In this case, we subset all the genes whose functional annotations contain the word 'antibiotic':

antibiotic = subsetFun(Hadza, fun = 'antibiotic', rescale_copy_number = F) 
plotTaxonomy(antibiotic, 'genus','percent', N = 10, rescale = T, others = T)

Fig11_Hadza16_antiobiotic_Tax.svg

Taxonomy barplot based on the 'antibiotic' subset. The barplot shows the taxonomic distribution and percentage in the different samples of the ORFs containing functions whose annotations include the word 'antibiotic'.

plotFunctions(antibiotic, fun_level = 'KEGG',count = 'tpm')

Fig15_Hadza16_antibiotic_Fun.svg

KEGG functional profile using TPM of the 'antibiotic' subset.

Also, you can subset all the genes related to a particular KEGG pathway. Ex:

aromatic_aa = subsetFun(Hadza, fun = 'Phenylalanine, tyrosine and tryptophan biosynthesis', rescale_tpm = F,rescale_copy_number = F)

This is useful to analyze which taxa are involved in a specific KEGG pathway:

# Plot taxonomy
plotTaxonomy(aromatic_aa, 'genus', 'percent', N = 10, rescale = F, others = T)

Taxonomy barplot at the genus level of the ‘aromatic_aa’ subset. The barplot shows the taxonomic distribution and percentage in the different samples of the ORFs containing functions from the ‘Phenylalanine,tyrosine and tryptophan biosynthesis’ KEGG pathway.

plotFunctions(aromatic_aa, fun_level = 'KEGG', count = 'tpm')

aromatic_aa.kegg.tpm.svg

Most abundant KEGGs in the ‘aromatic_aa’ subset. The abundance of each function (KEGG) is counted in TPM (‘transcripts per million’), genes from each function per million genes in the whole metagenome.

The function subsetFun also accepts regular expressions if the parameter fixed = F:

# Select genes annotated with K06001 OR K01696
trp_synthesis = subsetFun(Hadza, fun = 'K06001|K01696', fixed = F)
plotTaxonomy(trp_synthesis, rank = 'genus')

trp_synthase.genus.png

Taxonomy barplot at the genus rank of the genes annotated with the K06001 or the K01696 KEGGs.


2.3.3. Explore a specific bin

And if you are interested in bins, you can subset them from the SQM object:

maxbin005 = subsetBins(Hadza, 'maxbin.005.fasta.contigs' )
# Plot functions
plotFunctions(maxbin005, fun_level = 'KEGG', count = 'tpm')

maxbin005.kegg.tpm.svg

Most abundant KEGGs in the ‘maxbin005’ subset. The abundance of each KEGG is counted in TPM.


2.3.4. Combine different subsets!

Eg: Subset genes from carbohydrates metabolism of Bacteroidetes and from Amino acid metabolism of Proteobacteria.

bacteroidetes = subsetTax(Hadza, 'phylum','Bacteroidetes')
bacteroidetes_carbohydrates = subsetFun(bacteroidetes, 'Carbohydrate metabolism')
proteobacterias = subsetTax(Hadza, 'phylum','Proteobacteria')
proteobacterias_aminoacids = subsetFun(proteobacterias, 'Amino acid metabolism')
bact_carbo_proteo_amino = combineSQM(bacteroidetes_carbohydrates, proteobacterias_aminoacids, rescale_copy_number = F, tax_source = "contigs")
bact_carbo_proteo_amino = combineSQM(bacteroidetes_carbohydrates, proteobacterias_aminoacids, rescale_copy_number = F)
plotTaxonomy(bact_carbo_proteo_amino, 'phylum','percent', N = 10, rescale = T, others = T)

Fig16_Hadza16_combine_Tax_ProteoBacteroidetes.svg

Taxonomy barplot at the phylum level of the 'Carbohydrates metabolism of Bacteroidetes and Amino acid metabolism of Proteobacteria' subset.

plotFunctions(bact_carbo_proteo_amino, fun_level = 'KEGG', count = 'tpm')

TPM of most abundant KEGGs of the 'Carbohydrates metabolism of Bacteroidetes and Amino acid metabolism of Proteobacteria' subset.


2.4. Export results

2.4.1. Export table

If you don't feel comfortable using R, you have the option to export your tables in tsv format, which can be opened by different software such as Excel or LibreOffice Calc:

exportTable(bact_carbo_proteo_amino$functions$COG$tpm, 'Hadza_COG.tsv')  

2.4.2. Export Krona

If you prefer exploring your results through Krona charts:

    exportKrona(Hadza)

Krona.png Snapshot of a Krona chart.

2.4.3 Export Pathways

Other convenient function is the option of analyzing KEGG pathways. Clear up any doubt by checking if a KEGG pathway is complete or not in one or all the samples. Eg: KEGG PATHWAY: ko00400 'Phenylalanine, tyrosine and tryptophan biosynthesis' .

# To know if a KEGG is present or not in one sample,
# you can provide a vector of colors.
# Colors must be in the same order that samples
# To distinguish between Hadza and Italians:
# Repeat one color 8 times because the first 8 samples are from Hadza.
# The next 8 samples are from Italians, repeat their color.
colors = c(rep('#006682', 8), rep('#c26e00', 8))
exportPathway(Hadza, '00400', output_suffix = 'aromatic_aa', sample_colors = colors, max_scale_value = 3, count= 'copy_number')

ko00400.aromatic_aa.multi.png

Phenylalanine, tyrosine and tryptophan biosynthesis. Each square is an enzyme and it is divided in so many parts as samples are in the metagenome. Each enzyme is linked to one or more KEGGs. If the enzyme is present in one sample, its division is colored with the color of that sample. Hadza samples are shown in blue and Italian ones in orange. The hue of the color depends on the value: the darkest colors imply a copy number ≥ 3. White is for not present enzymes.

Also, it is possible to visualize the log2 fold-change of the KEGG abundances among two categories. This will help you to know if an enzyme is more abundant in one category (Hadza) or in the other (Italy):

# To calculate the log-fold change of a KEGG, you can provide a list of vectors.
# One vector with the samples names by condition:
#Hadza samples
H.samples = Hadza$misc$samples[grep('H', Hadza$misc$samples)]
IT.samples = Hadza$misc$samples[grep('IT', Hadza$misc$samples)]
condition = list(H.samples, IT.samples)
# Choose one color per condition
colors = c('#006682', '#c26e00')
# Plot log2 fold changes using copy number abundances in the selected KEGG pathway:
exportPathway(Hadza, '00400', output_suffix = 'aromatic_aa.log2FC', fold_change_colors = colors, fold_change_groups = condition, count ='copy_number', max_scale_value = 1.5)

ko00400.aromatic_aa.multi.png

Phenylalanine, tyrosine and tryptophan biosynthesis. Each square is an enzyme. The hue of the color depends on the value of the log-fold-change calculated using copy numbers: the darkest orange implies a log-fold-change value ≥ 1.5 (more abundant in Italian samples) and the darkest blue a log-fold-change ≤ -1.5 (more abundant in Hadza samples). White is for not present enzymes.


2.5. Use other R packages

Are you wondering how to use the SQM Object with other packages? Keep on reading!

2.5.1. Differencial abundances with DESeq2

Which functions are more abundant in the project? We use the DESeq2 package to analyze which functions are significantly more abundant in Hadza or Italians samples. All you have to do is select the matrix with your data from the SQM object and make a metadata dataframe:

# Tutorial based on http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
# Load the package
library('DESeq2')
# Required: matrix with the raw abundances, colData (metadata file)
metadata=as.data.frame(c(rep('H',8), rep('I',8)))
rownames(metadata)=colnames(Hadza$functions$KEGG$abund)
colnames(metadata)='condition'
# Verify sample order
all(rownames(metadata)==colnames(Hadza$functions$KEGG$abund))
# Convert your data to the format required by DESeq2
dds = DESeqDataSetFromMatrix(countData = Hadza$functions$KEGG$abund, colData = metadata, design = ~ condition)
# Remove low abundant KEGGs:
keep = rowSums(counts(dds)) >= 10
dds = dds[keep,]
# Choose factor levels
dds$condition=factor(dds$condition, levels=c('H','I'))
# Run DESeq2
dds2=DESeq(dds)
results=results(dds2, name='condition_I_vs_H')
plotMA(results, ylim=c(-2,2))

Fig7_Hadza16_deseq.svg

Gene differential abundance using DESeq2. Red coloured points are those with an adjusted p value lower than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

2.5.2. Use the vegan package with the SQM Object

Sometimes, it is useful to make an ordination plot or a cluster of your samples according to their similarity. If you use the 'vegan' package, you only have to take into account that you need to transpose the data.

library('vegan')
metadata = as.data.frame(c(rep('Hadza', 8), rep('Italy', 8)))
rownames(metadata) = colnames(Hadza$functions$KEGG$tpm)
colnames(metadata) = 'condition'
# Tranpose the matrix to have samples in rows.
kegg_tpm = t(Hadza$functions$KEGG$tpm)
MDS = metaMDS(kegg_tpm)
colvec = c('#006682','#c26e00') # colors
plot(MDS, display = 'sites')
with(metadata, points(MDS, display = 'sites', col = colvec[condition], pch = 19))
with(metadata, legend('topright', legend = levels(condition), col = colvec, pch = 21, pt.bg = colvec))

NMDS.svg

Non-metric multidimensional scaling (NMDS) ordination using KEGGs.

Additionally, you can run different statistical tests (eg: PERMANOVA test) to know if some variable (like the location) is affecting the clustering of samples in the ordination:

library('vegan')
metadata = as.data.frame(c(rep('Hadza', 8), rep('Italy', 8)))
rownames(metadata) = colnames(Hadza$functions$KEGG$tpm)
colnames(metadata) = 'condition'
# Tranpose the matrix to have samples in rows.
kegg_tpm = t(Hadza$functions$KEGG$tpm)
country = c(rep('Hadza', 8), rep('Italy', 8))
adonis(kegg_tpm~country)
Call :
    adonis (formula = kegg_tpm ~ country)
    Permutation : free
    Number of permutation: 999
    Terms added sequentially (first to last)
              Df SumsOfSqs  MeanSqs F.Model  R2      Pr(>F)
country       1  0.034942   0.034942 7.9894 0.36333 0.001 ***
Residuals     14 0.061231   0.004374        0.63667
Total         15 0.096173                   1.00000
---
Signif. codes:
0***0.001**0.01*0.05.0.1 ‘ ’ 1

According to this, samples from the same location are more similar in their functional annotations (KEGGs).

The end

This tutorial provides you with some examples. Now you have a tool to explore your SQM project even though you don't know so much about R. Good luck!!

Note: Save your R workspace

save.image('mySession.RData') #Save  R workspace
load('mySession.RData') #Load a previous  R workspace

Find more information about

The SqueezeMeta paper at: https://www.frontiersin.org/articles/10.3389/fmicb.2018.03349/full
The SQMtools paper at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03703-2