LipidomicAnalysis.Rmd

---
title: "The Hitchhiker’s Guide to untargeted lipidomics analysis: Practical guidelines"
author: "D. Smirnov, P. Mazin, M. Osetrova, E. Stekolshchikova, E. Khrameeva"
date: "8/18/2021"
output: 
  github_document:
    pandoc_args: "--webtex"

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction
Here we present step-by-step guide to bioinformatic analysis of untargeted LS-MS lipidomic data. This manual reproduces the key steps described in the manuscript:

* Data importing

* Lipid signal extraction (peak picking, peak alignment, peak grouping)

* Filtering and normalization

* Visualization

**Note:** normalization and statistical methods as well as preprocessing parameters cannot be universally applicable and should be chosen in each individual case based on experimental conditions, instrument characteristics and study purposes!


## Package import
Loading the packages required for analysis
```{r, results='hide', message=FALSE, warning=FALSE}
library(xcms)
library(ggplot2)
library(DT)
library(IPO)
library(mixOmics)
library(dplyr)
library(missForest)
library(reshape2)
library(gridExtra)
library(SummarizedExperiment)
```

## Data import
We will demonstrate the key concepts of LC-MS untargeted lipidomic analysis using a test dataset. Raw MS files (~1.87 GB) from this dataset converted into the .mzXML format can be downloaded into the current directory using the following code:
```{r}
#url <- "https://makarich.fbb.msu.ru/khrameeva/brainmap/sampledata.tar.gz"
#download.file(url, destfile = 'sampledata.tar.gz', method = "curl")
#untar('sampledata.tar.gz')
```

Raw MS files located in the `sampledata/` folder are organized into three subfolders according to the sample groups (2 files per group + blank measurement). The code below will create a table with sample metadata
```{r}
mzfiles <- list.files('sampledata/', recursive = TRUE, full.names = TRUE, pattern = '.mzXML')
group <- unlist(lapply(strsplit(mzfiles,"/"), function (x) x[[3]]))
pd <- data.frame(sample_name = sub(basename(mzfiles), pattern = ".mzXML", replacement = "", fixed = TRUE), 
                 sample_group = group, 
                 stringsAsFactors = FALSE)

knitr::kable(pd)
```

Now .mzXML files can be imported into `MSnExp` object via `readMSData` function
```{r}
raw_data <- readMSData(files = mzfiles, 
                       pdata = new("NAnnotatedDataFrame", pd), 
                       mode = "onDisk", 
                       msLevel = 1, 
                       verbose = T, 
                       centroided = T)
```


## Peak picking
Feature detection `CentWave` algorithm based on continuous wavelet transformation allows to distinguish lipid peaks from background noise. To perform peak calling we need to set up CentWave parameters first
```{r}
cwp <- CentWaveParam(peakwidth = c(9.5, 36),
                     ppm = 11.5,
                     noise = 0, 
                     snthresh = 10, 
                     mzdiff = -0.001, 
                     prefilter = c(3, 100), 
                     mzCenterFun = "wMean", 
                     integrate = 1, 
                     fitgauss = FALSE)
```
It's highly recommended to specify `peakwidth` (minimum and maximum MS peak width in RT dimension) and `ppm`(width of region of interest in the m/z dimension) parameters based on ion chromatograms for internal standards. 

Once the parameters are set one can proceed to chromatographic peak detection using `findChromPeaks` function.

```{r, message=FALSE}
xset <- findChromPeaks(raw_data, param = cwp)
```

## Peak alignment
Peak alignment procedure aims to eliminate retention times shifts between identified chromatographic peaks from samples. We will use OBI-warp algorithm implemented within `xcms` package to match peaks between MS runs.
```{r}
arp <- ObiwarpParam(distFun = "cor_opt", 
                    binSize = 1, 
                    response = 1, 
                    gapInit = 0.32, 
                    gapExtend = 2.688, 
                    factorDiag = 2, 
                    factorGap = 1,
                    localAlignment = FALSE)

xset <- adjustRtime(xset, param = arp)
```
`ObiwarpParam` function creates an object with parameters for the alignment and `adjustRtime` performs the peak matching.


An example of peaks before and after RT shift correction is shown below
```{r, fig.width = 7, fig.height= 3.5}
group_colors <- c("black", "red", "forestgreen")
names(group_colors) <- unique(xset$sample_group)

par(mfrow=c(1,2),las=1)
chr <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = F)
chr.adj <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = T)
plot(chr, peakType = "none", col=group_colors[xset$sample_group], main = "Before alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
plot(chr.adj, peakType = "none", col=group_colors[xset$sample_group], main = "After alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
```

## Peak grouping
Once retention time correction is done one can proceed to the correspondence analysis of aligned peaks. The general idea of peak grouping is to identify peaks from the same ion across samples and group them together to form a lipid feature. In order to do this analysis, we will use Peak density method that accessible in `xcms` via `groupChromPeaks` function. 
`PeakDensityParam` object contains grouping settings and is used as input for `groupChromPeaks`.

```{r}
pdp <- PeakDensityParam(sampleGroups = xset$sample_group, 
                        bw = 0.879999999999999, 
                        binSize = 0.02412, 
                        minFraction = 0.000001, 
                        minSamples = 1, 
                        maxFeatures = 50)

xset <- groupChromPeaks(xset, param = pdp)
```


## Selection of parameters for peak picking, alignment, and grouping
For simplicity and saving time, in the code sections above we provided parameters optimized for untargeted lipidome LC-MS measurements on a Reversed-Phase Bridged Ethyl Hybrid (BEH) C8 column reverse coupled to a Vanguard precolumn, using a Waters Acquity UPLC system and a heated electrospray ionization source in combination with a Bruker Impact II QTOF (quadrupole-Time-of-Flight) mass spectrometer. To customize the parameters for a particular LS-MS experiment we recommend optimizing them via `IPO` package. 

Please note that optimizing peak calling parameters with `optimizeXcmsSet` function is a quite computationally intensive operation. It may take hours or even days (depending on the number of samples) before the optimization process ends!

To perform the optimization just uncomment the code below. It will return the R script with optimized processing parameters.

Set up default parameters for peak picking optimization procedure:
```{r}
#peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')
#peakpickingParameters$min_peakwidth = c(0,10)
#peakpickingParameters$max_peakwidth = c(10,30)
#peakpickingParameters$ppm = c(0,10)
```

Optimize peak picking parameters:
```{r}
#resultPeakpicking <- optimizeXcmsSet(files = mzfiles, 
#                                     params = peakpickingParameters, 
#                                     nSlaves = 0, 
#                                     subdir = NULL)

#optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset
```

Optimize retention time correction and grouping parameters:
```{r}
#retcorGroupParameters <- getDefaultRetGroupStartingParams()
#resultRetcorGroup <- optimizeRetGroup(xset = optimizedXcmsSetObject, 
#                                      params = retcorGroupParameters,
#                                      nSlaves = 0,
#                                      subdir = NULL)

#writeRScript(resultPeakpicking$best_settings$parameters, 
#             resultRetcorGroup$best_settings, 1)
```


## Imputation of missing values
Unfortunately, peak peaking algorithm may produces a sufficient numbers of NAs for those samples in which it wasn't able to identify MS peaks. We will try to impute missing chromatographic peaks within samples using `fillChromPeaks` function.
```{r}
xset <- fillChromPeaks(xset)
```
Please note that `fillChromPeaks` may not impute all the gaps in MS data. The remaining missing values will be further removed/imputed in the section "Filtering of peaks" below.


## Data export
Now we can extract feature matrix from `xset` object
```{r, message=FALSE}
pks <- chromPeaks(xset)
grs <- featureDefinitions(xset)
mtx <- featureValues(xset, method="maxint", value="into", filled=T) 

knitr::kable(head(mtx))
```

Change the column names of mtx matrix
```{r, message=FALSE}
colnames(mtx) <- unlist(strsplit(colnames(mtx), split = '.mzXML'))
```

To correctly normalize abundance matrix `mtx` further we will need the information about internal standard used in the experiment. The code below will extract peaks corresponding to TG(15:0-18:1-d7-15:0).
```{r, message=FALSE}
source("src/rt-mz.annotator.R") # load the annotator function
```

Generate TAG that corresponds to the standard:
```{r}
std = generateTGL(15+18+15,1)
std$FORMULA = 'C51H89D7O6' #change 7 H to 7 D
std$EXACT_MASS = calcExactMass(std$FORMULA) # recalculate the exact mass using updated formula
knitr::kable(std)
```
Using `annotateByMass` function we will find all the peaks in the data that matched internal standard by exact mass value and filter them based on the elution time. In this function adducts can be defined by `ions` parameter, by default the function uses H, Na, NH4, K and aNH4. We will specify ppm threshold for acceptable m/z deviation in `annotateByMass` to 20.

```{r}
std.ann = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), std, ppm=20)
std.ann = std.ann[std.ann$rt > 600, ] # Although we don't know the exact RT value we know that TAGs elute after 10 minutes
knitr::kable(std.ann)
```
H-adduct looks suspicious (high ppm, different rt), so we can exclude it from analysis:
```{r}
std.ann = std.ann[std.ann$ppm < 5, ]
```

Finally, compute the median intensity of peaks found within each sample. Those will come in handy for normalization procedure.
```{r}
div <- apply(mtx[std.ann$id, c(2:5)], 2, function (x) median(x, na.rm = T))
```

## Annotation
Annotation is arguably the most tricky part of untargeted LS-MS analysis. Here we present a method to create annotation for lipid features obtained in the previous steps.

To obtain the full annotation set we will again utilize `annotateByMass` function, but this time m/z values of peaks will be matched to exact masses of existing lipids from LIPID MAPS database.
```{r, message=FALSE}
ann <- annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), db = LMDB, ppm = 20)
knitr::kable(head(ann))
```

```{r, message=FALSE}
table(ann$ion)
```

Here `ppmd` is calculated as 

![ppmd = (1 - \frac{MZ}{EM - Adduct}) \cdot 1e^6
](https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+ppmd+%3D+%281+-+%5Cfrac%7BMZ%7D%7BEM+-+Adduct%7D%29+%5Ccdot+1e%5E6%0A)

where $MZ$ is a peak m/z value, $EM$ is an exact mass of known lipid from LIPID MAPS database and $Adduct$ represents the adduct mass.

Next we will explore the distribution of `ppmd` values from the annotation table:  
```{r, message=FALSE}
h = hist(ann$ppmd, 1000, main = 'ppmd distribution', xlab = 'ppmd')
mode = which.max(h$counts)
mode = h$breaks[mode]/2 + h$breaks[mode+1]/2
abline(v = mode, col = 'red')
```
Most likely there is a m/z shift around -4 ppm (shown by red vertical line), it also corresponds well with `ppmd` of the internal standard.

Display the value of m/z shift
```{r}
mode
```

Take only annotation within mode +- 10
```{r}
ann = ann[ann$ppmd > (mode - 10) & ann$ppmd < (mode + 10), ]
```

Check annotation uniqueness
```{r}
annotation.freq <- table(table(ann$id))
print(paste(annotation.freq[1], "peaks have unique annotation"))
print(paste(annotation.freq[2], "peaks are annotated with two lipids"))
```

Add two columns with lipid categories and classes to `ann` table
```{r}
ann$category = substr(ann$LM_ID,3,4)
ann$class = substr(ann$LM_ID,3,6)
```

Display the number of lipids annotated uniquely on category level
```{r}
annc = unique(ann[,c('id','rt','mz','category')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on category level"))
```

We will retain features with unique annotation to build rt-mz plot: 
```{r}
t = table(annc$id) # count the annotation variants for all features
annc = annc[annc$id %in% names(t)[t==1],] # filter out all features with multiply annotation names
cats = unique(annc$category) # get the list of lipid categories
cols = setNames(RColorBrewer::brewer.pal(length(cats),'Set1'),cats) # set color palette
annc <- annc[annc$rt > 120 & annc$rt < 1200, ] # specify the range of RT 
plot(annc$rt,annc$mz,pch=16,col=cols[annc$category],xlab='RT', ylab='m/z', xlim = c(120, 1330), bty='n')
legend('topright',pch=16,col=cols,legend=cats) 
```

Print the number of lipids annotated uniquely on class level
```{r}
annc <- unique(ann[,c('id','rt','mz','class')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on class level"))
```

Keep annotated features only
```{r}
mtx <- mtx[unique(ann$id),]
```

Also one can try to perform annotation for specific lipid class only. We will illustrate such approach using a custom TAG generator that creates a table of triacylglycerols (TAG(10:0)	- TAG(70:8)) with corresponding lipid formula and exact masses.
```{r}
tags = generateTGL(10:70,0:8) 
electron.mass = 0.00054858
knitr::kable(head(tags))
```

We will use the same `annotateByMass` function as a before, but with previously generated TAG table instead of full LIPIDMAPS database.
```{r}
ann2 = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed),
                      tags,
                      ions = c(NH4=calcExactMass('NH4') - electron.mass),
                      ppm=20)

knitr::kable(head(ann2))
```

`ann2` annotation table contains a number of peaks that have non-unique annotation. We will plot the rt-m/z scatter plot and highlight the area of correct TAGs by red rectangle
```{r}
plot(ann2$rt,ann2$mz, xlab = 'RT', ylab = 'm/z')
rect(820, 700, 1050, 1000, border = 'red', col=NA)
```

Since the plot above showed that there are additional peaks in obtained TAG annotation table, we will try to exclude them using an approach based on grid-like pattern in rt-mz coordinates (please see the manuscript for the method description).
```{r}
nets <- lookForNets(ann2,rt.win=c(1,30)) # by default rt.win is in mins, we will change it to secs
net.table <- sort(table(nets$start)) # count the number of features per net 
start <- names(net.table)[length(names(net.table))] # get id of the biggest net
tags = nets[!is.na(nets$start) & nets$start==as.numeric(start),] # select the biggest net
plotNet(tags)
```
The net looks quite noisy, so we will plot it again, but with circle size based on value of `ppmd` deviation of peaks
```{r}
plot(tags$rt, tags$mz, cex = abs(tags$ppmd-mode)/6, xlab = 'RT', ylab = 'm/z')
```

Filter out the peaks with high `ppmd` deviation
```{r}
net <- plotNet(tags[abs(tags$ppmd-mode) < 5, ])
```

The net looks better, but there are still some peaks that pretends to be same lipid, so the filtering procedure needs manual curation.

## Filtering of peaks
We will perform feature filtering based on blank samples to retain all features for which the median concentration ratio between biological samples and blank samples is greater than 2.
```{r, message=FALSE}
med.MS <- apply(mtx, 1, function(x) log10(median(x[grep("MS",colnames(mtx),perl=T)], na.rm=T)))
med.blank <- apply(mtx, 1, function(x) log10(median(x[grep("Blank",colnames(mtx))], na.rm=T)))
filter.blank <- (med.MS - med.blank) > log10(2)
filter.blank[filter.blank==T] <- NA
filter.blank[is.na(filter.blank)] <- T
mtx <- mtx[filter.blank, c(2:5)]
```

The proportion of filtered features can be visualized using a mean-difference plot:
```{r, message=FALSE}
plot((med.MS+med.blank)/2, 
     med.MS-med.blank, 
     pch=21, las=1, bg="gray",  
     col="dimgray", cex=1.2, lwd=0.4, xlab = "Sample Intensity", ylab = "Sample - Blank")

points((med.MS[filter.blank]+med.blank[filter.blank])/2,
       med.MS[filter.blank]-med.blank[filter.blank],
       pch=21, bg="#B20F25", col="dimgray", cex=1.2, lwd=0.4)

abline(h=log10(2),col="#B20F25")
```

The code below removes all features that possess more than 30% of NA across samples.
```{r, message=FALSE}
th <- 0.3
peaks.nas <- apply(mtx, 1, function (x) sum(is.na(x)))
mtx <- mtx[(peaks.nas/ncol(mtx)) < th, ]
```

To impute missing values not filled by `fillChromPeaks` and not removed by filtering we will use an implementation of random forest algorithm from `MissForest` package.
```{r, message=FALSE}
mtx.imp <- missForest(mtx)
mtx <- mtx.imp$ximp
```

## Normalization
In order to make samples comparable to each other we will utilize sample-specific normalization by wet weight and internal standard.

Load the matrix of wet weights for our samples
```{r, warning=FALSE}
wetw <- as.matrix(read.csv("POS.WETWEIGHT.csv", header=F, row.names=1))
knitr::kable(wetw)
```

Perform the normalization
```{r}
wetw <- wetw[colnames(mtx), ]
mtx.normalized <- t(apply(mtx, 1, function (x) x*mean(wetw)*mean(div)/(wetw*div)))
mtx.log <- log10(mtx.normalized)
```

## Downstream analysis
Processed matrix with quantified lipid abundances can be used for downstream analysis. We will apply two classical multivariate approaches (PCA and PLS-DA) as well as univariate statistical methods to analyze the intergroup differences between lipid profiles. 

### Principal Component Analysis (PCA)
PCA is a multivariate technique that extremely useful for classification purpose. The key idea of the method is to project original matrix of lipid abundances into low dimensional space. To perform dimensionality reduction PCA computes the reduced set of uncorrelated variables named `Principal Components`. For a given matrix $X_{n \times m}$, where features are rows and samples are columns, principal component vectors can be defined by finding eigenvectors of the following sample covariance matrix $S$:

![\begin{align*}
S = \frac{1}{n-1} X^{T} C_n X
\end{align*}
](https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Cbegin%7Balign%2A%7D%0AS+%3D+%5Cfrac%7B1%7D%7Bn-1%7D+X%5E%7BT%7D+C_n+X%0A%5Cend%7Balign%2A%7D%0A)

where $C_n = I_n - \frac{1}{n}1_{n} 1^{T}_{n}$ is a centering matrix, $I_n$ represents an identity matrix of size $n$. 

To calculate principal components we will use base R function `prcomp`
```{r, message=FALSE}
pca <- prcomp(t(mtx.log), center = TRUE, scale. = TRUE)
```

To visualize relationships between samples in a new low dimensional space we will plot PC1 and PC2 against each other.

```{r, message=FALSE}
Y <- pd$sample_group[2:5] 
pca.data <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], class = Y)
ggplot(data = pca.data, aes_string(x = "PC1", y = "PC2", color = "class", shape = "class")) + 
    geom_point(size = 5) +
    theme_light()
```

### Partial Least-Squares Discriminant Analysis (PLS-DA)
While both PCA and PLS-DA achieve dimensionality reduction computing the principal components, PLS-DA can be used rather for classification and feature selection than for clustering purpose. Mathematically, PLS-DA principal components can be obtained in a similar to PCA manner, as eigenvectors of a matrix of covariances between $X$ and $Y$: 

![\begin{align*}
S = S_{xy} S_{yx} = \frac{1}{n-1} X^{T} C_n Y \times \frac{1}{n-1} Y^{T} C_n X = \frac{1}{(n-1)^2} X^{T} C_n Y Y^{T} C_n X
\end{align*}
](https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Cbegin%7Balign%2A%7D%0AS+%3D+S_%7Bxy%7D+S_%7Byx%7D+%3D+%5Cfrac%7B1%7D%7Bn-1%7D+X%5E%7BT%7D+C_n+Y+%5Ctimes+%5Cfrac%7B1%7D%7Bn-1%7D+Y%5E%7BT%7D+C_n+X+%3D+%5Cfrac%7B1%7D%7B%28n-1%29%5E2%7D+X%5E%7BT%7D+C_n+Y+Y%5E%7BT%7D+C_n+X%0A%5Cend%7Balign%2A%7D%0A)

where $C_n$ and $n$ represent a centering matrix and a total number of samples, respectively.

We will use the sparse version of the algorithm - sPLS-DA. First of all, matrix of predictors and vector of responses should be defined
```{r, message=FALSE}
X <- t(mtx.normalized)
Y <- as.factor(Y)
```

Tune sPLS-DA parameters using Leave-One-Out cross validation
```{r, message=FALSE}
list.keepX <- seq(1, 100, 2)
tune.splsda <- tune.splsda(X, Y, ncomp = 2, validation = 'loo', folds = 4, 
                           progressBar = FALSE, dist = 'max.dist',
                           test.keepX = list.keepX, nrepeat = 1)
```

Run sPLS-DA with optimized parameters
```{r, message=FALSE, warning = FALSE}
splsda.model <- splsda(X, Y, ncomp = 2, keepX = tune.splsda$choice.keepX)
plotIndiv(splsda.model, ind.names = FALSE, legend=TRUE, ellipse = TRUE)
```

A vector of feature contributions can be retrieved from the model in the following way
```{r, message=FALSE}
plsda.contributions <- selectVar(splsda.model, comp = 1)$value
```

## Statistical analysis
Univariate methods (e.g. t-test, Wilcoxon rank sum test, ANOVA) are especially useful for detecting differences in concentration between samples on the level of single molecules. 

As an illustrative example, we will apply Wilcoxon rank sum test (WRST) to the task of comparing Group1 and Group1 lipid abundances. In contrast to t-test, this test doesn't assume that the data has a normal distribution. 
 
```{r}
stats <- apply(mtx.normalized, 1, function (x) wilcox.test(x[1:2], x[3:4])$p.value)
```

Given the small sample size of groups studied (two samples per group) the minimal raw p-value we can obtain with WRST is 0.33.
```{r}
min(stats)
```

The one more important step when dealing with statistical testing of hundreds or thousands features is a multiple testing correction procedure. Setting a significance level to 0.05 one can found a 5% of features to be significant even they are not. So, if the 5000 lipids are tested, 250 lipids could be significant by chance. To avoid having large number of false positive results, raw p-value can be adjust using multiply testing correction methods (e.g. Bonferroni, Holm or FDR corrections). 

```{r}
p.val <- p.adjust(stats, method = 'fdr') 
```

The statistical testing often accompanied by `Fold Change` (FC) calculation. FC shows the magnitude and direction of change in lipid concentration between group studied. 
```{r}
lfc <- log2(rowMeans(mtx.normalized[,c(1,2)])) - log2(rowMeans(mtx.normalized[,c(3,4)]))
```

The code below creates a table with columns describing the mean concentrations across samples and log2(Fold Change) values obtained above. We will also specify the 2-fold FC cutoff for up- and downregulated features.
```{r}
mean.abundance <- (log2(rowMeans(mtx.normalized[,c(1,2)])) + log2(rowMeans(mtx.normalized[,c(3,4)])))/2 
ma.data <- as.data.frame(mean.abundance)
ma.data$lfc <- lfc
colnames(ma.data) <- c('MeanAbundance', 'Lfc')
ma.data$Sig <- 'NS'
ma.data$Sig[ma.data$Lfc > 1] <- 'Up'
ma.data$Sig[ma.data$Lfc < -1] <- 'Down'
```

Explore the number of up- and downregulated features:
```{r}
table(ma.data$Sig)
```

Now the differences in lipid abundance between Group1 and Group2 can be visualize using MA plot, where x-axis represents average abundance level and y-axis represents log2(Fold Change) values.
```{r}
p <- ggplot(ma.data, aes(x = MeanAbundance, y = Lfc, color = Sig)) + 
  geom_point() + 
  geom_hline(yintercept = c(-1, 1), linetype = "dashed") +
  scale_colour_manual(values=c("#1465AC", "darkgray", "#B31B21"))+
  xlab('log2(Mean normalized abundance)') + ylab('log2(Fold Change)') +
  theme_light()

p
```

## Software used
```{r}
sessionInfo()
```