-
Notifications
You must be signed in to change notification settings - Fork 1
/
exercise_5.Rmd
503 lines (337 loc) · 20.4 KB
/
exercise_5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
---
title: 'Exercises'
output:
html_document:
toc: false
code_folding: hide
---
```{r setup, echo=FALSE, purl = FALSE}
knitr::opts_chunk$set(echo=TRUE, message = FALSE, warning = FALSE, eval = FALSE, cache = FALSE)
SOLUTIONS <- FALSE
```
\
## Exercise 5: Basic statistics in R
\
Read [Chapter 6](https://intro2r.com/stats_r.html) to help you complete the questions in this exercise.
\
Although this short course is primarily focussed on introducing you to R, it wouldn't be complete if we didn't have a peek at some of R's statistical roots. Having said that, this will be a very brief overview with very little in the way of theory so don't worry if you get a little lost - this is just a taster, the main course is still to come!
\
1\. Download the datafile *'prawnGR.csv'* from the **[<i class="fa fa-download"></i> Data](data.html)** link and save it to the `data` directory. Import these data into R and assign to a variable with an appropriate name. These data were collected from an experiment to investigate the difference in growth rate of the [giant tiger prawn](https://en.wikipedia.org/wiki/Penaeus_monodon) (*Penaeus monodon*) fed either an artificial or natural diet. Have a quick look at the structure of this dataset and plot the growth rate versus the diet using an appropriate plot. How many observations are there in each diet treatment?
```{r Q1, echo=SOLUTIONS}
prawns <- read.table('data/prawnGR.CSV', sep = ",", header = TRUE,
stringsAsFactors = TRUE)
# or
prawns <- read.csv("data/prawnGR.CSV", stringsAsFactors = TRUE)
# take a look at the data
str(prawns)
# 'data.frame': 60 obs. of 2 variables:
# $ GRate: num 9.77 10.29 10.05 10.08 9.31 ...
# $ diet : Factor w/ 2 levels "Artificial","Natural":...
summary(prawns)
# GRate diet
# Min. : 7.395 Artificial:30
# 1st Qu.: 9.550 Natural :30
# Median : 9.943
# Mean : 9.920
# 3rd Qu.:10.344
# Max. :11.632
# how many replicates for each level of diet
table(prawns$diet)
# Artificial Natural
# 30 30
# or use xtabs
xtabs(~ diet, data = prawns)
# produce a boxplot
boxplot(GRate ~ diet, data = prawns, xlab = "Diet", ylab = "Growth Rate")
```
\
2\. You want to compare the difference in growth rate between the two diets using a two sample t-test. Before you conduct the test, you need to assess the normality and equal variance assumptions. Use the function `shapiro.test()` to assess normality of growth rate for each diet separately (Hint: use your indexing skills to extract the growth rate for each diet `GRate[diet=='Natural']` first). Use the function `var.test()` to test for equal variance (see `?var.test` for more information or [Section 6.1](https://intro2r.com/one-and-two-sample-tests.html#one-and-two-sample-tests) of the book for more details). Are your data normally distributed and have equal variances? Note: We don't really advocate using these 'approaches' for assessing the normality and equal variance assumptions assumptions but include them here as many people will still want to use them. A much better way to assess assumptions is to use diagnostic plots of the residuals (see Q6 for an example).
```{r Q2, echo=SOLUTIONS}
# test normality assumption
# Do not perform test on all data together, i.e.
shapiro.test(prawns$GRate) # this is wrong!!
# Need to test for departures from normality for each group
# separately. Remember your indexing [ ]
shapiro.test(prawns$GRate[prawns$diet == "Artificial"])
# Shapiro-Wilk normality test
#
# data: prawns$GRate[prawns$diet == "Artificial"]
# W = 0.9486, p-value = 0.1553
shapiro.test(prawns$GRate[prawns$diet == "Natural"])
# Shapiro-Wilk normality test
#
# data: prawns$GRate[prawns$diet == "Natural"]
# W = 0.9598, p-value = 0.307
# Therefore no evidence to reject the Null hypothesis and data are normally distributed
# However much better assessment of normality is to use a quantile - quantile plot
# looking for points to lie along the line for normality
qqnorm(prawns$GRate[prawns$diet == "Artificial"])
qqline(prawns$GRate[prawns$diet == "Artificial"])
qqnorm(prawns$GRate[prawns$diet == "Natural"])
qqline(prawns$GRate[prawns$diet == "Natural"])
# test for equal variance
# if normal
# Null hypothesis Ho: variances are equal
var.test(prawns$GRate ~ prawns$diet)
# F test to compare two variances
# data: prawns$GRate by prawns$diet
# F = 1.9629, num df = 29, denom df = 29, p-value = 0.07445
# alternative hypothesis: true ratio of variances is not equal to 1
# 95 percent confidence interval:
# 0.9342621 4.1240043
# sample estimates:
# ratio of variances
# 1.962881
# No evidence to reject null hypothesis (P=0.07) therefore no
# difference in variance
```
\
3\. Conduct a two sample t-test using the `t.test()` function ([Section 6.1](https://intro2r.com/one-and-two-sample-tests.html#one-and-two-sample-tests) of the book). Use the argument `var.equal = TRUE` to perform the t-test assuming equal variances. What is the null hypothesis you want to test? Do you reject or fail to reject the null hypothesis? What is the value of the t statistic, degrees of freedom and p value? How would you summarise these summary statistics in a report?
```{r Q3, echo=SOLUTIONS}
# conduct t-test assuming equal variances
# Null hypothesis Ho: no difference in growth rate
# between prawns fed on artificial diet or Natural diet
t.test(GRate ~ diet, var.equal = TRUE, data = prawn)
# Two Sample t-test
#
# data: prawns$GRate by prawns$diet
# t = -1.3259, df = 58, p-value = 0.1901
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.6319362 0.1283495
# sample estimates:
# mean in group Artificial mean in group Natural
# 9.794133 10.045927
#
# No evidence to reject the Null hypothesis, therefore no
# difference in growth rate of prawns fed on either artificial
# or natural diet (t = -1.33, df = 58, p = 0.19).
```
\
4\. An alternative (but equivalent) way to compare the mean growth rate between diets is to use a linear model. Use the `lm()` function to fit a linear model with `GRate` as the response variable and `diet` as an explanatory variable (see [Section 6.3](https://intro2r.com/simple_lm.html#simple_lm) for a very brief introduction to linear modelling). Assign (`<-`) the results of the linear model to a variable with an appropriate name (i.e. `growth.lm`).
```{r Q4, echo=SOLUTIONS}
# fit the model
growth.lm <- lm(GRate ~ diet, data = prawns)
```
\
5\. Produce an ANOVA table for the fitted model using the `anova()` function i.e. `anova(growth.lm)`. Compare the ANOVA p value to the p value obtained using a t-test. What do you notice? What is the value of the F statistics and degrees of freedom? How would you summarise these results in a report?
```{r Q5, echo=SOLUTIONS}
# produce the ANOVA table
anova(growth.lm)
# Analysis of Variance Table
#
# Response: GRate
# Df Sum Sq Mean Sq F value Pr(>F)
# diet 1 0.951 0.95100 1.7579 0.1901
# Residuals 58 31.377 0.54098
# notice the p value is the same as for the t-test
# also if you square the t statistic from the t-test
# you get the F statistic from the linear model.
# They're the same test
```
\
6\. Assess the normality and equal variance assumptions by plotting the residuals of the fitted model (see [Section 6.3](https://intro2r.com/simple_lm.html#simple_lm) for more details). Split the plotting device into 2 rows and 2 columns using `par(mfrow=c(2,2))` so you can fit four plots on a single device. Use the `plot()` function on your fitted model (`plot(growth.lm)`) to plot the graphs. Discuss with an instructor how to interpret these plots. What are your conclusions?
```{r Q6, echo=SOLUTIONS}
# plot the residuals to assess normality and equal variance
# divide the plotting device into 2 rows and 2 columns to get all
# the graphs on one device
par(mfrow = c(2,2))
plot(growth.lm)
```
\
7\. Download the datafile *'Gigartina.csv'* from the **[<i class="fa fa-download"></i> Data](data.html)** link and save it to the `data` directory. Import the dataset into R and assign the dataframe an appropriate name. These data were collected from a study to examine the change in `diameter` of red algae [*Mastocarpus stellatus*](https://en.wikipedia.org/wiki/Mastocarpus_stellatus) spores grown in three different diatom cultures and a control group grown in artificial seawater (`diatom.treat` variable). Use the function `str()` to examine the dataframe. How many replicates are there per diatom treatment? Use an appropriate plot to examine whether there are any obvious differences in diameter between the treatments.
```{r Q7, echo=SOLUTIONS}
gigartina <- read.table('data/Gigartina.CSV', header = TRUE, sep = ",",
stringsAsFactors = TRUE)
# or
gigartina <- read.csv('data/Gigartina.CSV',
stringsAsFactors = TRUE)
str(gigartina)
# 'data.frame': 40 obs. of 2 variables:
# $ diameter : int 110 115 110 108 109 101 101 98 120 ...
# $ diatom.treat: Factor w/ 4 levels "ASGM","Sdecl",..: 1 1...
table(gigartina$diatom.treat)
# ASGM Sdecl Sexpo Sstat
# 10 10 10 10
# or use xtabs
xtabs(~ diatom.treat, data = gigartina)
# diatom.treat
# ASGM Sdecl Sexpo Sstat
# 10 10 10 10
# plot these data
boxplot(diameter ~ diatom.treat, data = gigartina, xlab = "diatom treatment", ylab = "diameter (um)")
#or if you want to do the fancy um symbol correctly
boxplot(diameter ~ diatom.treat, data = gigartina, xlab = "diatom treatment", ylab = expression(paste("diameter", " (",mu,"m)")))
```
\
8\. You wish to compare the mean diameter of *Metacarpus* grown in the four treatment groups (`ASGM`, `Sdecl`, `Sexpo`, `Sstat`) using a one-way analysis of variance (ANOVA). What is your null hypothesis?
```{r, Q8, echo=SOLUTIONS}
# The null hypothesis Ho: there is no difference in mean diameter
# of the spores between the different treatment groups
```
\
9\. We will conduct the ANOVA using the linear model function `lm()` once again. Make sure you know which of the variables is your response variable and which is your explanatory variable (ask an instructor if in doubt). Fit the linear model and assign the model output to a variable with an appropriate name (i.e. `gigartina.lm`).
```{r Q9, echo=SOLUTIONS}
gigartina.lm <- lm(diameter ~ diatom.treat, data = gigartina)
```
\
10\. Produce an ANOVA table using the `anova()` function. What is the value of the p value? Do you reject or fail to reject the null hypothesis? What is the value of the *F* statistic and degrees of freedom? How would you report these summary statistics in a report?
```{r Q10, echo=SOLUTIONS}
anova(gigartina.lm)
# Analysis of Variance Table
#
# Response: diameter
# Df Sum Sq Mean Sq F value Pr(>F)
# diatom.treat 3 1880.3 626.76 22.775 1.929e-08 ***
# Residuals 36 990.7 27.52
# ---
# reject the null hypothesis, therefore there is a significant
# difference in the diameter between the treatment groups
# (F_3,36 = 22.78, p < 0.001)
```
\
11\. Assess the assumptions of normality and equal variance of the residuals by producing the residual plots as before. Don’t forget to split the plotting device into 2 rows and 2 columns before plotting. Discuss with an instructor whether the residuals meet these assumptions. Do you accept this model as acceptable?
```{r Q11, echo=SOLUTIONS}
par(mfrow = c(2,2))
plot(gigartina.lm)
# residual plots look ok to me!
```
\
12\. Let’s compare the treatment group means to determine which treatment group is different from other treatment groups. In general, you should be careful with these types of post-hoc comparisons, especially if you have a large number of groups (There are much better ways to do this, but that's for another course!). In this case we only have 4 groups, and therefore we will use Tukey’s Honest significant difference to perform the comparisons and control for type 1 error rate (rejecting a true null hypothesis).
We will use the function `TukeyHSD()` from the `mosaic` package to perform these comparisons (you will need to install this package first and then use `library(mosaic)` to make the function available). Which groups are different from each other if we use the p-value cutoff (alpha) of p < 0.05?
```{r Q13, echo=SOLUTIONS}
# what group mean is different from what? Post-hoc comparisons.
# we will use Tukey's Honest significant difference method
# to compare group means.
# install.packages('mosaic')
library(mosaic)
# compare the group means using TukeysHSD method
TukeyHSD(gigartina.lm)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = diameter ~ diatom.treat, data = gigartina)
#
# $diatom.treat
# diff lwr upr p adj
# Sdecl-ASGM -14.3 -20.6184102 -7.98159 0.0000030
# Sexpo-ASGM -8.9 -15.2184102 -2.58159 0.0029489
# Sstat-ASGM -18.3 -24.6184102 -11.98159 0.0000000
# Sexpo-Sdecl 5.4 -0.9184102 11.71841 0.1165421
# Sstat-Sdecl -4.0 -10.3184102 2.31841 0.3360087
# Sstat-Sexpo -9.4 -15.7184102 -3.08159 0.0016145
# the null hypothesis for each comparison is
# grp1 - grp2 = 0 (i.e. no difference)
# Sdecl-ASGM, Sexpo-ASGM, Sstat-ASGM and Sstat-Sexpo
# are significantly different
```
\
13\. We can also produce a plot of the comparisons to help us interpret the table of comparisons. Use the `plot()` function with the `TukeyHSD(gigartina.lm)`. Ask if you get stuck (or Google it!).
```{r Q14, echo=SOLUTIONS}
plot(TukeyHSD(gigartina.lm), cex.axis = 0.5, las = 2)
```
\
14\. Download the *'TemoraBR.csv'* file from the **[<i class="fa fa-download"></i> Data](data.html)** link and save it to the `data` directory. Import the dataset into R and as usual assign it to a variable. These data are from an experiment that was conducted to investigate the relationship between temperature (`temp`) and the beat rate (Hz) `beat_rate` of the copepod [*Temora longicornis*](https://en.wikipedia.org/wiki/Temora_longicornis) which had been acclimatised at three different temperature regimes (`acclimitisation_temp`). Examine the structure of the dataset. How many variables are there? What type of variables are they? Which is the response (dependent) variable, and which are the explanatory (independent) variables?
```{r Q15, echo=SOLUTIONS, tidy = TRUE}
temora <- read.table('data/TemoraBR.CSV', header = TRUE, sep = ",",
stringsAsFactors = TRUE)
# or
temora <- read.csv('data/TemoraBR.CSV',
stringsAsFactors = TRUE)
str(temora)
# 'data.frame': 45 obs. of 3 variables:
# $ temp : num 5 6 7 10 11 12 13 15 16 17 ...
# $ beat_rate : num 3.76 5.4 8 9.4 16.6 18.5 19...
# $ acclimitisation_temp: int 5 5 5 5 5 5 5 5 5 5 ...
```
\
15\. What type of variable is `acclimitisation_temp`? Is it a factor? Convert `acclimitisation_temp` to a factor and store the result in a new column in your dataframe called `Facclimitisation_temp`. Hint: use the function `factor()`. Use an appropriate plot to visualise these data (perhaps a coplot or similar?).
```{r Q16, echo=SOLUTIONS, tidy = TRUE}
temora$Facclimitisation_temp <- factor(temora$acclimitisation_temp)
# boxplot of beat rate and acclimitisation temp
boxplot(beat_rate ~ Facclimitisation_temp, data = temora, xlab = "acclimitisation temp", ylab = "beat rate")
# scatter plot using the with function
with(temora, plot(beat_rate ~ temp, xlab = "temperature", ylab = "beat rate"))
# using a coplot
coplot(beat_rate ~ temp | Facclimitisation_temp, data = temora)
# scatter plot with different symbols and colours
with(temora, plot(beat_rate ~ temp, xlab = "temperature", ylab = "beat rate", col = as.numeric(Facclimitisation_temp), pch = as.numeric(Facclimitisation_temp)))
legend("topleft", legend = c("5", "10", "20"), pch = unique(as.numeric(temora$Facclimitisation_temp)), col = unique(as.numeric(temora$Facclimitisation_temp)))
# or more flexibly
plot(beat_rate ~ temp, xlab = "temperature", ylab = "beat rate", type = "n", data = temora)
with(temora, points(beat_rate[Facclimitisation_temp == "5"] ~ temp[Facclimitisation_temp == "5"], pch = 1, col = "black"))
with(temora, points(beat_rate[Facclimitisation_temp == "10"] ~ temp[Facclimitisation_temp == "10"], pch = 2, col = "red"))
with(temora, points(beat_rate[Facclimitisation_temp == "20"] ~ temp[Facclimitisation_temp == "20"], pch = 3, col = "blue"))
legend("topleft", legend = c("5", "10", "20"), col = c("black", "red","blue"), pch = c(1,2,3))
```
\
16\. We will analyse these data using an Analysis of Covariance (ANCOVA) to compare the slopes and the intercepts of the relationship between `beat_rate` and `temp` for each level of `Facclimatisation_temp`. Take a look at the plot you produced in Q16, do you think the relationships are different?
```{r Q17, echo=SOLUTIONS}
# the slope of the relationship between beat rate and temp
# look different for each acclimitisation temp
```
\
17\. As usual we will fit the model using the `lm()` function. You will need to fit the main effects of `temp` and `Facclimatisation_temp` and the interaction between `temp` and `Facclimatisation_temp`. You can do this using either of the equivalent specifications:
`temp + Facclimatisation_temp + temp:Facclimatisation_temp` or
`temp * Facclimatisation_temp`
```{r Q18, echo=SOLUTIONS, tidy = TRUE}
temora.lm <- lm(beat_rate ~ temp + Facclimitisation_temp + temp:Facclimitisation_temp, data = temora)
# or equivalently
temora.lm <- lm(beat_rate ~ temp * Facclimitisation_temp, data = temora)
```
\
18\. Produce the summary ANOVA table as usual. Is the interaction between `temp` and `Facclimatisation_temp` significant? What is the interpretation of the interaction term? Should we interpret the main effects of ```temp``` and `Facclimatisation_temp` as well?
```{r Q19, echo=SOLUTIONS}
anova(temora.lm)
# Analysis of Variance Table
# Response: beat_rate
# Df Sum Sq Mean Sq F value Pr(>F)
# temp 1 4293.7 4293.7 835.866 < 2.2e-16 ***
# Facclimitisation_temp 2 1197.7 598.8 116.576 < 2.2e-16 ***
# temp:Facclimitisation_temp 2 284.1 142.0 27.651 3.331e-08 ***
# Residuals 39 200.3 5.1
# there is a significant interaction between temp and
# Facclimitisation_temp therefore there is a significant
# relationship between beat_rate and temp, and this relationship
# is different depending on the level of Facclimitisation_temp.
# Therefore we should not interpret the main effect of temp
# or Facclimitisation_temp
```
\
19\. Assess the assumptions of normality and equal variance of the residuals in the usual way. Do the residuals meet these assumptions? Discuss with a instructor.
```{r Q20, echo=SOLUTIONS}
par(mfrow = c(2,2))
plot(temora.lm)
# there is a hint of heterogeneity of variance (non equal variance)
# as the variance increases with the fitted values. This is typical
# of count data.
```
\
20\. Write a short summary in you R script (don’t forget to comment this out with `#`) describing the interpretation of this model. Report the appropriate summary statistics such as *F* values, degrees of freedom and p values.
\
21\. (Optional) refit the model using the square root transformed `beat_rate` as the response variable. Does the interpretation of the model change? Do the validation plots of the residuals look better?
```{r Q22, echo=SOLUTIONS}
# we could try square root transforming the variable
# beat_rate to stabilise the variance
# square root transform beat_rate and store in the dataframe
temora$SQRT_beatrate <- sqrt(temora$beat_rate)
# refit the model using the square root transformed data
temora.lm2 <- lm(SQRT_beatrate ~ temp * Facclimitisation_temp, data = temora)
par(mfrow = c(2,2))
plot(temora.lm2)
# Residuals look a bit better
# now lets look at the ANOVA table for our new model
anova(temora.lm2)
# Analysis of Variance Table
#
# Response: SQRT_beatrate
# Df Sum Sq Mean Sq F value Pr(>F)
# temp 1 67.916 67.916 712.1746 < 2.2e-16 ***
# Facclimitisation_temp 2 17.600 8.800 92.2782 1.632e-15 ***
# temp:Facclimitisation_temp 2 1.151 0.576 6.0353 0.005205 **
# Residuals 39 3.719 0.095
# model has the same interpretation but the p value for the
# interaction term is a bit larger.
```
\
End of Exercise 5