-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLipidomicAnalysis.Rmd
532 lines (428 loc) · 23.3 KB
/
LipidomicAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
---
title: "The Hitchhiker’s Guide to untargeted lipidomics analysis: Practical guidelines"
author: "D. Smirnov, P. Mazin, M. Osetrova, E. Stekolshchikova, E. Khrameeva"
date: "8/18/2021"
output:
github_document:
pandoc_args: "--webtex"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Here we present step-by-step guide to bioinformatic analysis of untargeted LS-MS lipidomic data. This manual reproduces the key steps described in the manuscript:
* Data importing
* Lipid signal extraction (peak picking, peak alignment, peak grouping)
* Filtering and normalization
* Visualization
**Note:** normalization and statistical methods as well as preprocessing parameters cannot be universally applicable and should be chosen in each individual case based on experimental conditions, instrument characteristics and study purposes!
## Package import
Loading the packages required for analysis
```{r, results='hide', message=FALSE, warning=FALSE}
library(xcms)
library(ggplot2)
library(DT)
library(IPO)
library(mixOmics)
library(dplyr)
library(missForest)
library(reshape2)
library(gridExtra)
library(SummarizedExperiment)
```
## Data import
We will demonstrate the key concepts of LC-MS untargeted lipidomic analysis using a test dataset. Raw MS files (~1.87 GB) from this dataset converted into the .mzXML format can be downloaded into the current directory using the following code:
```{r}
#url <- "https://makarich.fbb.msu.ru/khrameeva/brainmap/sampledata.tar.gz"
#download.file(url, destfile = 'sampledata.tar.gz', method = "curl")
#untar('sampledata.tar.gz')
```
Raw MS files located in the `sampledata/` folder are organized into three subfolders according to the sample groups (2 files per group + blank measurement). The code below will create a table with sample metadata
```{r}
mzfiles <- list.files('sampledata/', recursive = TRUE, full.names = TRUE, pattern = '.mzXML')
group <- unlist(lapply(strsplit(mzfiles,"/"), function (x) x[[3]]))
pd <- data.frame(sample_name = sub(basename(mzfiles), pattern = ".mzXML", replacement = "", fixed = TRUE),
sample_group = group,
stringsAsFactors = FALSE)
knitr::kable(pd)
```
Now .mzXML files can be imported into `MSnExp` object via `readMSData` function
```{r}
raw_data <- readMSData(files = mzfiles,
pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk",
msLevel = 1,
verbose = T,
centroided = T)
```
## Peak picking
Feature detection `CentWave` algorithm based on continuous wavelet transformation allows to distinguish lipid peaks from background noise. To perform peak calling we need to set up CentWave parameters first
```{r}
cwp <- CentWaveParam(peakwidth = c(9.5, 36),
ppm = 11.5,
noise = 0,
snthresh = 10,
mzdiff = -0.001,
prefilter = c(3, 100),
mzCenterFun = "wMean",
integrate = 1,
fitgauss = FALSE)
```
It's highly recommended to specify `peakwidth` (minimum and maximum MS peak width in RT dimension) and `ppm`(width of region of interest in the m/z dimension) parameters based on ion chromatograms for internal standards.
Once the parameters are set one can proceed to chromatographic peak detection using `findChromPeaks` function.
```{r, message=FALSE}
xset <- findChromPeaks(raw_data, param = cwp)
```
## Peak alignment
Peak alignment procedure aims to eliminate retention times shifts between identified chromatographic peaks from samples. We will use OBI-warp algorithm implemented within `xcms` package to match peaks between MS runs.
```{r}
arp <- ObiwarpParam(distFun = "cor_opt",
binSize = 1,
response = 1,
gapInit = 0.32,
gapExtend = 2.688,
factorDiag = 2,
factorGap = 1,
localAlignment = FALSE)
xset <- adjustRtime(xset, param = arp)
```
`ObiwarpParam` function creates an object with parameters for the alignment and `adjustRtime` performs the peak matching.
An example of peaks before and after RT shift correction is shown below
```{r, fig.width = 7, fig.height= 3.5}
group_colors <- c("black", "red", "forestgreen")
names(group_colors) <- unique(xset$sample_group)
par(mfrow=c(1,2),las=1)
chr <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = F)
chr.adj <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = T)
plot(chr, peakType = "none", col=group_colors[xset$sample_group], main = "Before alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
plot(chr.adj, peakType = "none", col=group_colors[xset$sample_group], main = "After alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
```
## Peak grouping
Once retention time correction is done one can proceed to the correspondence analysis of aligned peaks. The general idea of peak grouping is to identify peaks from the same ion across samples and group them together to form a lipid feature. In order to do this analysis, we will use Peak density method that accessible in `xcms` via `groupChromPeaks` function.
`PeakDensityParam` object contains grouping settings and is used as input for `groupChromPeaks`.
```{r}
pdp <- PeakDensityParam(sampleGroups = xset$sample_group,
bw = 0.879999999999999,
binSize = 0.02412,
minFraction = 0.000001,
minSamples = 1,
maxFeatures = 50)
xset <- groupChromPeaks(xset, param = pdp)
```
## Selection of parameters for peak picking, alignment, and grouping
For simplicity and saving time, in the code sections above we provided parameters optimized for untargeted lipidome LC-MS measurements on a Reversed-Phase Bridged Ethyl Hybrid (BEH) C8 column reverse coupled to a Vanguard precolumn, using a Waters Acquity UPLC system and a heated electrospray ionization source in combination with a Bruker Impact II QTOF (quadrupole-Time-of-Flight) mass spectrometer. To customize the parameters for a particular LS-MS experiment we recommend optimizing them via `IPO` package.
Please note that optimizing peak calling parameters with `optimizeXcmsSet` function is a quite computationally intensive operation. It may take hours or even days (depending on the number of samples) before the optimization process ends!
To perform the optimization just uncomment the code below. It will return the R script with optimized processing parameters.
Set up default parameters for peak picking optimization procedure:
```{r}
#peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')
#peakpickingParameters$min_peakwidth = c(0,10)
#peakpickingParameters$max_peakwidth = c(10,30)
#peakpickingParameters$ppm = c(0,10)
```
Optimize peak picking parameters:
```{r}
#resultPeakpicking <- optimizeXcmsSet(files = mzfiles,
# params = peakpickingParameters,
# nSlaves = 0,
# subdir = NULL)
#optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset
```
Optimize retention time correction and grouping parameters:
```{r}
#retcorGroupParameters <- getDefaultRetGroupStartingParams()
#resultRetcorGroup <- optimizeRetGroup(xset = optimizedXcmsSetObject,
# params = retcorGroupParameters,
# nSlaves = 0,
# subdir = NULL)
#writeRScript(resultPeakpicking$best_settings$parameters,
# resultRetcorGroup$best_settings, 1)
```
## Imputation of missing values
Unfortunately, peak peaking algorithm may produces a sufficient numbers of NAs for those samples in which it wasn't able to identify MS peaks. We will try to impute missing chromatographic peaks within samples using `fillChromPeaks` function.
```{r}
xset <- fillChromPeaks(xset)
```
Please note that `fillChromPeaks` may not impute all the gaps in MS data. The remaining missing values will be further removed/imputed in the section "Filtering of peaks" below.
## Data export
Now we can extract feature matrix from `xset` object
```{r, message=FALSE}
pks <- chromPeaks(xset)
grs <- featureDefinitions(xset)
mtx <- featureValues(xset, method="maxint", value="into", filled=T)
knitr::kable(head(mtx))
```
Change the column names of mtx matrix
```{r, message=FALSE}
colnames(mtx) <- unlist(strsplit(colnames(mtx), split = '.mzXML'))
```
To correctly normalize abundance matrix `mtx` further we will need the information about internal standard used in the experiment. The code below will extract peaks corresponding to TG(15:0-18:1-d7-15:0).
```{r, message=FALSE}
source("src/rt-mz.annotator.R") # load the annotator function
```
Generate TAG that corresponds to the standard:
```{r}
std = generateTGL(15+18+15,1)
std$FORMULA = 'C51H89D7O6' #change 7 H to 7 D
std$EXACT_MASS = calcExactMass(std$FORMULA) # recalculate the exact mass using updated formula
knitr::kable(std)
```
Using `annotateByMass` function we will find all the peaks in the data that matched internal standard by exact mass value and filter them based on the elution time. In this function adducts can be defined by `ions` parameter, by default the function uses H, Na, NH4, K and aNH4. We will specify ppm threshold for acceptable m/z deviation in `annotateByMass` to 20.
```{r}
std.ann = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), std, ppm=20)
std.ann = std.ann[std.ann$rt > 600, ] # Although we don't know the exact RT value we know that TAGs elute after 10 minutes
knitr::kable(std.ann)
```
H-adduct looks suspicious (high ppm, different rt), so we can exclude it from analysis:
```{r}
std.ann = std.ann[std.ann$ppm < 5, ]
```
Finally, compute the median intensity of peaks found within each sample. Those will come in handy for normalization procedure.
```{r}
div <- apply(mtx[std.ann$id, c(2:5)], 2, function (x) median(x, na.rm = T))
```
## Annotation
Annotation is arguably the most tricky part of untargeted LS-MS analysis. Here we present a method to create annotation for lipid features obtained in the previous steps.
To obtain the full annotation set we will again utilize `annotateByMass` function, but this time m/z values of peaks will be matched to exact masses of existing lipids from LIPID MAPS database.
```{r, message=FALSE}
ann <- annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), db = LMDB, ppm = 20)
knitr::kable(head(ann))
```
```{r, message=FALSE}
table(ann$ion)
```
Here `ppmd` is calculated as
![ppmd = (1 - \frac{MZ}{EM - Adduct}) \cdot 1e^6
](https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+ppmd+%3D+%281+-+%5Cfrac%7BMZ%7D%7BEM+-+Adduct%7D%29+%5Ccdot+1e%5E6%0A)
where $MZ$ is a peak m/z value, $EM$ is an exact mass of known lipid from LIPID MAPS database and $Adduct$ represents the adduct mass.
Next we will explore the distribution of `ppmd` values from the annotation table:
```{r, message=FALSE}
h = hist(ann$ppmd, 1000, main = 'ppmd distribution', xlab = 'ppmd')
mode = which.max(h$counts)
mode = h$breaks[mode]/2 + h$breaks[mode+1]/2
abline(v = mode, col = 'red')
```
Most likely there is a m/z shift around -4 ppm (shown by red vertical line), it also corresponds well with `ppmd` of the internal standard.
Display the value of m/z shift
```{r}
mode
```
Take only annotation within mode +- 10
```{r}
ann = ann[ann$ppmd > (mode - 10) & ann$ppmd < (mode + 10), ]
```
Check annotation uniqueness
```{r}
annotation.freq <- table(table(ann$id))
print(paste(annotation.freq[1], "peaks have unique annotation"))
print(paste(annotation.freq[2], "peaks are annotated with two lipids"))
```
Add two columns with lipid categories and classes to `ann` table
```{r}
ann$category = substr(ann$LM_ID,3,4)
ann$class = substr(ann$LM_ID,3,6)
```
Display the number of lipids annotated uniquely on category level
```{r}
annc = unique(ann[,c('id','rt','mz','category')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on category level"))
```
We will retain features with unique annotation to build rt-mz plot:
```{r}
t = table(annc$id) # count the annotation variants for all features
annc = annc[annc$id %in% names(t)[t==1],] # filter out all features with multiply annotation names
cats = unique(annc$category) # get the list of lipid categories
cols = setNames(RColorBrewer::brewer.pal(length(cats),'Set1'),cats) # set color palette
annc <- annc[annc$rt > 120 & annc$rt < 1200, ] # specify the range of RT
plot(annc$rt,annc$mz,pch=16,col=cols[annc$category],xlab='RT', ylab='m/z', xlim = c(120, 1330), bty='n')
legend('topright',pch=16,col=cols,legend=cats)
```
Print the number of lipids annotated uniquely on class level
```{r}
annc <- unique(ann[,c('id','rt','mz','class')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on class level"))
```
Keep annotated features only
```{r}
mtx <- mtx[unique(ann$id),]
```
Also one can try to perform annotation for specific lipid class only. We will illustrate such approach using a custom TAG generator that creates a table of triacylglycerols (TAG(10:0) - TAG(70:8)) with corresponding lipid formula and exact masses.
```{r}
tags = generateTGL(10:70,0:8)
electron.mass = 0.00054858
knitr::kable(head(tags))
```
We will use the same `annotateByMass` function as a before, but with previously generated TAG table instead of full LIPIDMAPS database.
```{r}
ann2 = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed),
tags,
ions = c(NH4=calcExactMass('NH4') - electron.mass),
ppm=20)
knitr::kable(head(ann2))
```
`ann2` annotation table contains a number of peaks that have non-unique annotation. We will plot the rt-m/z scatter plot and highlight the area of correct TAGs by red rectangle
```{r}
plot(ann2$rt,ann2$mz, xlab = 'RT', ylab = 'm/z')
rect(820, 700, 1050, 1000, border = 'red', col=NA)
```
Since the plot above showed that there are additional peaks in obtained TAG annotation table, we will try to exclude them using an approach based on grid-like pattern in rt-mz coordinates (please see the manuscript for the method description).
```{r}
nets <- lookForNets(ann2,rt.win=c(1,30)) # by default rt.win is in mins, we will change it to secs
net.table <- sort(table(nets$start)) # count the number of features per net
start <- names(net.table)[length(names(net.table))] # get id of the biggest net
tags = nets[!is.na(nets$start) & nets$start==as.numeric(start),] # select the biggest net
plotNet(tags)
```
The net looks quite noisy, so we will plot it again, but with circle size based on value of `ppmd` deviation of peaks
```{r}
plot(tags$rt, tags$mz, cex = abs(tags$ppmd-mode)/6, xlab = 'RT', ylab = 'm/z')
```
Filter out the peaks with high `ppmd` deviation
```{r}
net <- plotNet(tags[abs(tags$ppmd-mode) < 5, ])
```
The net looks better, but there are still some peaks that pretends to be same lipid, so the filtering procedure needs manual curation.
## Filtering of peaks
We will perform feature filtering based on blank samples to retain all features for which the median concentration ratio between biological samples and blank samples is greater than 2.
```{r, message=FALSE}
med.MS <- apply(mtx, 1, function(x) log10(median(x[grep("MS",colnames(mtx),perl=T)], na.rm=T)))
med.blank <- apply(mtx, 1, function(x) log10(median(x[grep("Blank",colnames(mtx))], na.rm=T)))
filter.blank <- (med.MS - med.blank) > log10(2)
filter.blank[filter.blank==T] <- NA
filter.blank[is.na(filter.blank)] <- T
mtx <- mtx[filter.blank, c(2:5)]
```
The proportion of filtered features can be visualized using a mean-difference plot:
```{r, message=FALSE}
plot((med.MS+med.blank)/2,
med.MS-med.blank,
pch=21, las=1, bg="gray",
col="dimgray", cex=1.2, lwd=0.4, xlab = "Sample Intensity", ylab = "Sample - Blank")
points((med.MS[filter.blank]+med.blank[filter.blank])/2,
med.MS[filter.blank]-med.blank[filter.blank],
pch=21, bg="#B20F25", col="dimgray", cex=1.2, lwd=0.4)
abline(h=log10(2),col="#B20F25")
```
The code below removes all features that possess more than 30% of NA across samples.
```{r, message=FALSE}
th <- 0.3
peaks.nas <- apply(mtx, 1, function (x) sum(is.na(x)))
mtx <- mtx[(peaks.nas/ncol(mtx)) < th, ]
```
To impute missing values not filled by `fillChromPeaks` and not removed by filtering we will use an implementation of random forest algorithm from `MissForest` package.
```{r, message=FALSE}
mtx.imp <- missForest(mtx)
mtx <- mtx.imp$ximp
```
## Normalization
In order to make samples comparable to each other we will utilize sample-specific normalization by wet weight and internal standard.
Load the matrix of wet weights for our samples
```{r, warning=FALSE}
wetw <- as.matrix(read.csv("POS.WETWEIGHT.csv", header=F, row.names=1))
knitr::kable(wetw)
```
Perform the normalization
```{r}
wetw <- wetw[colnames(mtx), ]
mtx.normalized <- t(apply(mtx, 1, function (x) x*mean(wetw)*mean(div)/(wetw*div)))
mtx.log <- log10(mtx.normalized)
```
## Downstream analysis
Processed matrix with quantified lipid abundances can be used for downstream analysis. We will apply two classical multivariate approaches (PCA and PLS-DA) as well as univariate statistical methods to analyze the intergroup differences between lipid profiles.
### Principal Component Analysis (PCA)
PCA is a multivariate technique that extremely useful for classification purpose. The key idea of the method is to project original matrix of lipid abundances into low dimensional space. To perform dimensionality reduction PCA computes the reduced set of uncorrelated variables named `Principal Components`. For a given matrix $X_{n \times m}$, where features are rows and samples are columns, principal component vectors can be defined by finding eigenvectors of the following sample covariance matrix $S$:
![\begin{align*}
S = \frac{1}{n-1} X^{T} C_n X
\end{align*}
](https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Cbegin%7Balign%2A%7D%0AS+%3D+%5Cfrac%7B1%7D%7Bn-1%7D+X%5E%7BT%7D+C_n+X%0A%5Cend%7Balign%2A%7D%0A)
where $C_n = I_n - \frac{1}{n}1_{n} 1^{T}_{n}$ is a centering matrix, $I_n$ represents an identity matrix of size $n$.
To calculate principal components we will use base R function `prcomp`
```{r, message=FALSE}
pca <- prcomp(t(mtx.log), center = TRUE, scale. = TRUE)
```
To visualize relationships between samples in a new low dimensional space we will plot PC1 and PC2 against each other.
```{r, message=FALSE}
Y <- pd$sample_group[2:5]
pca.data <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], class = Y)
ggplot(data = pca.data, aes_string(x = "PC1", y = "PC2", color = "class", shape = "class")) +
geom_point(size = 5) +
theme_light()
```
### Partial Least-Squares Discriminant Analysis (PLS-DA)
While both PCA and PLS-DA achieve dimensionality reduction computing the principal components, PLS-DA can be used rather for classification and feature selection than for clustering purpose. Mathematically, PLS-DA principal components can be obtained in a similar to PCA manner, as eigenvectors of a matrix of covariances between $X$ and $Y$:
![\begin{align*}
S = S_{xy} S_{yx} = \frac{1}{n-1} X^{T} C_n Y \times \frac{1}{n-1} Y^{T} C_n X = \frac{1}{(n-1)^2} X^{T} C_n Y Y^{T} C_n X
\end{align*}
](https://render.githubusercontent.com/render/math?math=%5Cdisplaystyle+%5Cbegin%7Balign%2A%7D%0AS+%3D+S_%7Bxy%7D+S_%7Byx%7D+%3D+%5Cfrac%7B1%7D%7Bn-1%7D+X%5E%7BT%7D+C_n+Y+%5Ctimes+%5Cfrac%7B1%7D%7Bn-1%7D+Y%5E%7BT%7D+C_n+X+%3D+%5Cfrac%7B1%7D%7B%28n-1%29%5E2%7D+X%5E%7BT%7D+C_n+Y+Y%5E%7BT%7D+C_n+X%0A%5Cend%7Balign%2A%7D%0A)
where $C_n$ and $n$ represent a centering matrix and a total number of samples, respectively.
We will use the sparse version of the algorithm - sPLS-DA. First of all, matrix of predictors and vector of responses should be defined
```{r, message=FALSE}
X <- t(mtx.normalized)
Y <- as.factor(Y)
```
Tune sPLS-DA parameters using Leave-One-Out cross validation
```{r, message=FALSE}
list.keepX <- seq(1, 100, 2)
tune.splsda <- tune.splsda(X, Y, ncomp = 2, validation = 'loo', folds = 4,
progressBar = FALSE, dist = 'max.dist',
test.keepX = list.keepX, nrepeat = 1)
```
Run sPLS-DA with optimized parameters
```{r, message=FALSE, warning = FALSE}
splsda.model <- splsda(X, Y, ncomp = 2, keepX = tune.splsda$choice.keepX)
plotIndiv(splsda.model, ind.names = FALSE, legend=TRUE, ellipse = TRUE)
```
A vector of feature contributions can be retrieved from the model in the following way
```{r, message=FALSE}
plsda.contributions <- selectVar(splsda.model, comp = 1)$value
```
## Statistical analysis
Univariate methods (e.g. t-test, Wilcoxon rank sum test, ANOVA) are especially useful for detecting differences in concentration between samples on the level of single molecules.
As an illustrative example, we will apply Wilcoxon rank sum test (WRST) to the task of comparing Group1 and Group1 lipid abundances. In contrast to t-test, this test doesn't assume that the data has a normal distribution.
```{r}
stats <- apply(mtx.normalized, 1, function (x) wilcox.test(x[1:2], x[3:4])$p.value)
```
Given the small sample size of groups studied (two samples per group) the minimal raw p-value we can obtain with WRST is 0.33.
```{r}
min(stats)
```
The one more important step when dealing with statistical testing of hundreds or thousands features is a multiple testing correction procedure. Setting a significance level to 0.05 one can found a 5% of features to be significant even they are not. So, if the 5000 lipids are tested, 250 lipids could be significant by chance. To avoid having large number of false positive results, raw p-value can be adjust using multiply testing correction methods (e.g. Bonferroni, Holm or FDR corrections).
```{r}
p.val <- p.adjust(stats, method = 'fdr')
```
The statistical testing often accompanied by `Fold Change` (FC) calculation. FC shows the magnitude and direction of change in lipid concentration between group studied.
```{r}
lfc <- log2(rowMeans(mtx.normalized[,c(1,2)])) - log2(rowMeans(mtx.normalized[,c(3,4)]))
```
The code below creates a table with columns describing the mean concentrations across samples and log2(Fold Change) values obtained above. We will also specify the 2-fold FC cutoff for up- and downregulated features.
```{r}
mean.abundance <- (log2(rowMeans(mtx.normalized[,c(1,2)])) + log2(rowMeans(mtx.normalized[,c(3,4)])))/2
ma.data <- as.data.frame(mean.abundance)
ma.data$lfc <- lfc
colnames(ma.data) <- c('MeanAbundance', 'Lfc')
ma.data$Sig <- 'NS'
ma.data$Sig[ma.data$Lfc > 1] <- 'Up'
ma.data$Sig[ma.data$Lfc < -1] <- 'Down'
```
Explore the number of up- and downregulated features:
```{r}
table(ma.data$Sig)
```
Now the differences in lipid abundance between Group1 and Group2 can be visualize using MA plot, where x-axis represents average abundance level and y-axis represents log2(Fold Change) values.
```{r}
p <- ggplot(ma.data, aes(x = MeanAbundance, y = Lfc, color = Sig)) +
geom_point() +
geom_hline(yintercept = c(-1, 1), linetype = "dashed") +
scale_colour_manual(values=c("#1465AC", "darkgray", "#B31B21"))+
xlab('log2(Mean normalized abundance)') + ylab('log2(Fold Change)') +
theme_light()
p
```
## Software used
```{r}
sessionInfo()
```