diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 72941f04..00000000 Binary files a/.DS_Store and /dev/null differ diff --git a/pgt52m/.DS_Store b/pgt52m/.DS_Store deleted file mode 100644 index 88cadaaf..00000000 Binary files a/pgt52m/.DS_Store and /dev/null differ diff --git a/r4babs4/r4babs4.qmd b/r4babs4/r4babs4.qmd index 772f7f33..537dafe9 100644 --- a/r4babs4/r4babs4.qmd +++ b/r4babs4/r4babs4.qmd @@ -20,30 +20,57 @@ The BABS4 Module Learning outcomes that relate to the Data Analysis in R content # How R4BABS 4 is organised -Each x has: + -- An overview on the "About" page which gives the Learning Objectives, a topic summary and the instructions for the week. You should read this first. + -- Some independent study on the "Prepare!" page to prepare you for the workshop. This will be reading from the course book ([Computational Analysis for Bioscientists](https://3mmarand.github.io/comp4biosci/)), watching a video, or doing some coding or set up. It is designed to take about 30-45 mins on average. You will most likely learn best if you can find people to study with. + -- A two-hour workshop using R. This will usually start with me doing a short demonstration of one or more of the examples that were in "Prepare!" but you will spend most of the session going through some exercises. Anything you have not done before is explained and guided but you will also have to use the skills gained in previous workshops. I often remind you to take care of future you by making notes so you can look up your previous work but you can also search the [R4BABS](https://3mmarand.github.io/R4BABS/) site (search is top right). Talking to other people in the workshop about the exercises and working together will really help you understand more. There will be plenty of help from me and my demonstrators. + -- Some independent study on the "Consolidate!" page to give you more practice. The exercises are usually similar to those in the workshop but with less guidance. Occasionally, there will be reading to do. It is designed to take about 30-45 mins on average but may be quicker if you understood the workshop very well or slower if you need to revisit the workshop. + -Learning Data Analysis in R is like learning to speak a new language or play an instrument or a technical sport - you can't really rush it or cram for it. You need regular practice. + -- a little bit of engagement and practice is always better than none + -- if you get behind, just pick up where you left off rather than jumping in. It is fine to work on a previous week's workshop + # Content -## All Strands +## Week 1 Core -many var and obs, examples, PCA, log fc transformation, normalisation, QC gating, missing values, excluding proteins id'd from fewer thaan two peptides, FDR, data structures +Workshop -## -## +All strands -## \ No newline at end of file +many var and obs, examples, use the emphasise equivalance +PCA, log fc transformation, normalisation, QC gating, missing values, excluding proteins id'd from fewer than two peptides, FDR, data structures + + +## Week 2 Immunobiology - 1 + +Workshop + +There are three different cell treatments: Media, Lipopolysaccharide (LPS) and E. coli (NeonGreen fluorescence)) +Two experimental conditions: Isotype antibody and TNFalpha antibody conjugated to Allophycocyanin (APC) + + + +## Week 3 + + + +## Week 4 Immunobiology - 2 + +Workshop + + + +## Week 5 + + +## Week 6 Immunobiology - 3 + +Workshop \ No newline at end of file diff --git a/r4babs4/week-1/data-raw/adipocytes.txt b/r4babs4/week-1/data-raw/adipocytes.txt new file mode 100644 index 00000000..1c4c608c --- /dev/null +++ b/r4babs4/week-1/data-raw/adipocytes.txt @@ -0,0 +1,31 @@ +adiponectin treatment +7.39 nicotinic +7.58 nicotinic +6.87 nicotinic +7.23 nicotinic +9.32 nicotinic +5.72 nicotinic +7.12 nicotinic +9.74 nicotinic +8.55 nicotinic +6.31 nicotinic +6.71 nicotinic +8.01 nicotinic +10.62 nicotinic +8.33 nicotinic +3.13 nicotinic +3.77 control +2.46 control +6.9 control +5.89 control +4.31 control +7.86 control +4.35 control +6.61 control +6.19 control +6.52 control +3.94 control +5.86 control +6.36 control +6.83 control +5.34 control diff --git a/r4babs4/week-1/data-raw/beewing.txt b/r4babs4/week-1/data-raw/beewing.txt new file mode 100644 index 00000000..62039e44 --- /dev/null +++ b/r4babs4/week-1/data-raw/beewing.txt @@ -0,0 +1,101 @@ +wing +3.6 +3.7 +3.8 +3.8 +4 +4 +4 +4 +4.1 +4.1 +4.1 +4.1 +4.1 +4.1 +4.2 +4.2 +4.2 +4.2 +4.2 +4.2 +4.2 +4.3 +4.3 +4.3 +4.3 +4.3 +4.3 +4.3 +4.3 +4.4 +4.4 +4.4 +4.4 +4.4 +4.4 +4.4 +4.4 +4.4 +5.1 +5.1 +5.1 +5.1 +4.5 +4.5 +4.5 +4.5 +4.5 +4.5 +4.5 +4.5 +4.5 +4.5 +4.9 +4.9 +4.9 +4.9 +4.9 +4.9 +4.9 +5.2 +5.2 +5.3 +5.3 +4.6 +4.6 +4.6 +4.6 +4.6 +4.6 +4.6 +4.6 +4.6 +4.6 +5 +5 +5 +5 +5 +5.4 +5.5 +4.7 +4.7 +4.7 +4.7 +4.7 +4.7 +4.7 +4.7 +4.7 +4.8 +4.8 +4.8 +4.8 +4.8 +4.8 +4.8 +4.8 +3.9 +3.9 +5 diff --git a/r4babs4/week-1/data-raw/neuron.txt b/r4babs4/week-1/data-raw/neuron.txt new file mode 100644 index 00000000..35c3f146 --- /dev/null +++ b/r4babs4/week-1/data-raw/neuron.txt @@ -0,0 +1,9 @@ +csa +160.6 +129.7 +153.7 +128.1 +137.4 +137.7 +149.6 +142 diff --git a/r4babs4/week-1/images/answer.png b/r4babs4/week-1/images/answer.png new file mode 100644 index 00000000..77438911 Binary files /dev/null and b/r4babs4/week-1/images/answer.png differ diff --git a/r4babs4/week-1/images/do_in_R.png b/r4babs4/week-1/images/do_in_R.png new file mode 100644 index 00000000..a531b10d Binary files /dev/null and b/r4babs4/week-1/images/do_in_R.png differ diff --git a/r4babs4/week-1/images/do_on_internet.png b/r4babs4/week-1/images/do_on_internet.png new file mode 100644 index 00000000..ec33100b Binary files /dev/null and b/r4babs4/week-1/images/do_on_internet.png differ diff --git a/r4babs4/week-1/images/do_on_your_computer.png b/r4babs4/week-1/images/do_on_your_computer.png new file mode 100644 index 00000000..4726f307 Binary files /dev/null and b/r4babs4/week-1/images/do_on_your_computer.png differ diff --git a/r4babs4/week-1/overview.qmd b/r4babs4/week-1/overview.qmd new file mode 100644 index 00000000..45fa022d --- /dev/null +++ b/r4babs4/week-1/overview.qmd @@ -0,0 +1,51 @@ +--- +title: "Overview" +subtitle: "The logic of hypothesis testing and confidence intervals" +toc: true +toc-location: right +--- + +This week we will cover the logic of consider the logic of hypothesis testing and type 1 and type 2 errors. We will also find out what the sampling distribution of the mean and the standard error are, and how to calculate confidence intervals. + + +![Artwork by @allison_horst: "type 1 error"](images/type-1-error.png){fig-alt="A cute population with a normal distribution with two samples, one which happens to have at mean at the low end of the population and one at the high end. The population is saying 'yes, I'm still sure I birthed both of you'. Because these samples are relatively rare - would happen less than 5% of the time - we would conclude they did not come from this population. That would be making a type 1 error." width="600"} + + +![Artwork by @allison_horst: "type 2 error"](images/type-2-error.png){fig-alt="Two cute populations with a normal distribution with very little overlap. There are two samples, one which happens to have at mean at the low end of one population and and the other at the high end of the other population so that the two samples are close togther. Deciding these two samples were from the same population would be making a type 2 error." width="600"} + + +### Learning objectives + +The successful student will be able to: + +- demonstrate the process of hypothesis testing with an example + +- explain type 1 and type 2 errors + +- define the sampling distribution of the mean and the standard error + +- explain what a confidence interval is + +- calculate confidence intervals for large and small samples + +### Instructions + +1. [Prepare](study_before_workshop.qmd) + + i. 📖 Read The logic of hyothesis testing + ii. 📖 Read Confidence Intervals + + +2. [Workshop](workshop.qmd) + + i. 💻 Remind yourself how to import files + ii. 💻 Calculate confidence intervals on large + iii. 💻 Calculate confidence intervals on small samples. + + +3. [Consolidate](study_after_workshop.qmd) + + i. 💻 Calculate confidence intervals for each group in a data set + + + diff --git a/r4babs4/week-1/study_after_workshop.qmd b/r4babs4/week-1/study_after_workshop.qmd new file mode 100644 index 00000000..4fa3d5ba --- /dev/null +++ b/r4babs4/week-1/study_after_workshop.qmd @@ -0,0 +1,56 @@ +--- +title: "Independent Study to consolidate this week" +subtitle: "The logic of hypothesis testing and confidence intervals" +toc: true +toc-location: right +format: + html: + code-fold: true + code-summary: "Answer - don't look until you have tried!" +--- + +# Set up + +If you have just opened RStudio you will want to load the **`tidyverse`** package + +```{r} +#| code-fold: false +library(tidyverse) +``` + +# Exercises + +1. 💻 Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in [adipocytes.txt](data-raw/adipocytes.txt). Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. Estimate the mean Adiponectin concentration in each group - this means calculate the sample mean and construct a confidence interval around it for each group. This exercise forces you to bring together ideas from this workshop and from previous workshops + +- How to calculate a confidence intervals (this workshop) +- How to summarise variables in more than one group (previous workshop) + +```{r} +#| output: false + +# data import +adip <- read_table("data-raw/adipocytes.txt") + +# examine the structure +str(adip) + +# summarise +adip_summary <- adip %>% + group_by(treatment) %>% + summarise(mean = mean(adiponectin), + sd = sd(adiponectin), + n = length(adiponectin), + se = sd/sqrt(n), + dif = qt(0.975, df = n - 1) * se, + lower_ci = mean - dif, + uppp_ci = mean + dif) + + +# we conclude we're 95% certain the mean for the control group is +# between 4.73 and 6.36 and the mean for the nicotinic group is +# between 6.52 and 8.50. More usually we might put is like this: +# the mean for the control group is 5.55 +/- 0.82 and that for the nicotinic group is 7.51 +/- 0.99 + + +``` + diff --git a/r4babs4/week-1/study_before_workshop.qmd b/r4babs4/week-1/study_before_workshop.qmd new file mode 100644 index 00000000..73888678 --- /dev/null +++ b/r4babs4/week-1/study_before_workshop.qmd @@ -0,0 +1,12 @@ +--- +title: "Independent Study to prepare for workshop" +subtitle: "The logic of hypothesis testing and confidence intervals" +toc: true +toc-location: right +--- + +1. 📖 Read [The logic of hyothesis testing](https://3mmarand.github.io/comp4biosci/logic_hyopthesis_testing.html) + +2. 📖 Read [Confidence Intervals](https://3mmarand.github.io/comp4biosci/confidence_intervals.html) + + diff --git a/r4babs4/week-1/workshop.qmd b/r4babs4/week-1/workshop.qmd new file mode 100644 index 00000000..f850bb2a --- /dev/null +++ b/r4babs4/week-1/workshop.qmd @@ -0,0 +1,256 @@ +--- +title: "Workshop" +subtitle: "The logic of hypothesis testing and confidence intervals" +toc: true +toc-location: right +--- + +```{r} +#| include: false +library(tidyverse) +library(kableExtra) +``` + +# Introduction + +![Artwork by @allison_horst: "love this class"](images/love-this-class.png){fig-alt="A little monster flying a biplane wearing aviator glasses, pulling a banner that says 'Fully expecting to hate this class.' Below, a teacher wearing a cheerleading outfit labeled 'STATS' with a bullhorn labeled 'CODE' cheering desperately with pom-poms, trying to help students believe stats is actually going to be awesomely life-changing." width="800"} + +## Session overview + +In this session you will remind yourself how to import files, and calculate confidence intervals on large and small samples. + +## Philosophy + +Workshops are not a test. It is expected that you often don't know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips + +- don't worry about making mistakes +- don't let what you can not do interfere with what you can do +- discussing code with your neighbours will help +- look things up in the independent study material +- look things up in your own code from earlier +- there are no stupid questions + +::: callout-note +## Key + +These four symbols are used at the beginning of each instruction so you know where to carry out the instruction. + +![](images/do_on_your_computer.png) Something you need to do on your computer. It may be opening programs or documents or locating a file. + +![](images/do_in_R.png) Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files. + +![](images/do_on_internet.png) Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file. + +![](images/answer.png) A question for you to think about and answer. Record your answers in your script for future reference. +::: + +# Getting started + +![](images/do_on_your_computer.png) Start RStudio from the Start menu. + +![](images/do_in_R.png) Go the Files tab in the lower right pane and click on the `...` on the right. This will open a "Go to folder" window. Navigate to a place on your computer where you keep your work. Click Open. + +![](images/do_in_R.png) Also on the Files tab click on `New Folder`. Type "data-analysis-in-r-2" in to the box. This will be the folder that we work in throughout the Data Analysis in R part BABS2. + +![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. Name the RStudio Project 'week-1'. + +![](images/do_in_R.png) Make a new script then save it with a name like analysis.R to carry out the rest of the work. + +![](images/do_in_R.png) Add a comment to the script: `# Introduction to RStudio and your first graph` and load the **`tidyverse`** [@tidyverse] package + +![](images/do_in_R.png) Make a new folder called `data-raw`. + +# Exercises + +## Remind yourself how to import files! + +[Importing data from files](../../r4babs1/week-8/workshop.html#importing-data-from-files) was covered in BABS 1 [@rand2023] if you need to remind yourself. + +## Confidence intervals (large samples) + +The data in [beewing.txt](data-raw/beewing.txt) are left wing widths of 100 honey bees (mm). The confidence interval for large samples is given by: + + +$\bar{x} \pm 1.96 \times s.e.$ + + +Where 1.96 is the quantile for 95% confidence. + +![](images/do_on_your_computer.png) Save [beewing.txt](data-raw/beewing.txt) to your `data-raw` folder. + +![](images/do_in_R.png) Read in the data and check the structure of the resulting dataframe. + +```{r} +#| include: false + +#---CODING ANSWER--- + +bee <- read_table("data-raw/beewing.txt") +str(bee) +``` + +![](images/do_in_R.png) Calculate and assign to variables: the mean, standard deviation and standard error: + +```{r} +# mean +m <- mean(bee$wing) + +# standard deviation +sd <- sd(bee$wing) + +# sample size (needed for the se) +n <- length(bee$wing) + +# standard error +se <- sd / sqrt(n) +``` + +![](images/do_in_R.png) To calculate the 95% confidence interval we need to look up the quantile (multiplier) using `qnorm()` + +```{r} +q <- qnorm(0.975) +``` + +This should be about 1.96. + +![](images/do_in_R.png) Now we can use it in our confidence interval calculation + +```{r} +lcl <- m - q * se +ucl <- m + q * se +``` + +![](images/do_in_R.png) Print the values + +```{r} +lcl +ucl +``` + +This means we are 95% confident the population mean lies between 4.47 mm and 4.63 mm. The usual way of expressing this is that the mean is 4.55 +/- 0.07 mm + +![](images/do_in_R.png) Between what values would you be 99% confident of the population mean being? + +```{r} +#| include: false + +#---CODING ANSWER--- + +# qnorm(0.975) gives the quantile for 95%. For 99% we need qnorm(0.995) +q <- qnorm(0.995) +lcl <- m - q * se +ucl <- m + q * se + +``` + + + + + + +## Confidence intervals (small samples) + +The confidence interval for small samples is given by: + + +$\bar{x} \pm \sf t_{[d.f]} \times s.e.$ + +The only difference between the calculation for small and large sample is the multiple. For large samples we use the "the standard normal distribution" accessed with `qnorm()`; for small samples we use the "*t* distribution" assessed with `qt()`.The value returned by `q(t)` is larger than that returned by `qnorm()` which reflects the greater uncertainty we have on estimations of population means based on small samples. + + +The fatty acid Docosahexaenoic acid (DHA) is a major component of membrane phospholipids in nerve cells and deficiency leads to many behavioural and functional deficits. The cross sectional area of neurons in the CA 1 region of the hippocampus of normal rats is 155 $\mu m^2$. A DHA deficient diet was fed to 8 animals and the cross sectional area (csa) of neurons is given in [neuron.txt](data-raw/neuron.txt) + +![](images/do_on_your_computer.png) Save [neuron.txt](data-raw/neuron.txt) to your `data-raw` folder + +![](images/do_in_R.png) Read in the data and check the structure of the resulting dataframe + +```{r} +#| include: false + +#---CODING ANSWER--- +neur <- read_table("data-raw/neuron.txt") +``` + +![](images/do_in_R.png) Assign the mean to `m`. + +```{r} +#| include: false + +#---CODING ANSWER--- + +m <- mean(neur$csa) + +``` + +![](images/do_in_R.png) Calculate and assign the standard error to `se`. + +```{r} +#| include: false + +#---CODING ANSWER--- + +# I created intermediate variables for sd and n +sd <- sd(neur$csa) +n <- length(neur$csa) +se <- sd / sqrt(n) +``` + +To work out the confidence interval for our sample mean we need to use the *t* distribution because it is a small sample. This means we need to determine the degrees of freedom (the number in the sample minus one). + +![](images/do_in_R.png) We can assign this to a variable, `df`, using: + +```{r} +df <- length(neur$csa) - 1 +``` + +![](images/do_in_R.png) The *t* value is found by: + +```{r} +t <- qt(0.975, df = df) +``` + +Note that we are using `qt()` rather than `qnorm()` but that the probability, 0.975, used is the same. Finally, we need to put our mean, standard error and *t* value in the equation. $\bar{x} \pm \sf t_{[d.f]} \times s.e.$. + +![](images/do_in_R.png) The upper confidence limit is: + +```{r} +(m + t * se) |> round(2) +``` + +The first part of the command, `(m + t * se)` calculates the upper limit. This is 'piped' in to the `round()` function to round the result to two decimal places. + +![](images/do_in_R.png) Calculate the lower confidence limit: + +```{r include=FALSE} +(m - t * se) |> round(2) +``` + +![](images/answer.png) Given the upper and lower confidence values for the estimate of the population mean, what do you think about the effect of the DHA deficient diet? + + + + + + + + + +You're finished! + +# 🥳 Well Done! 🎉 + +![Artwork by @allison_horst: "We belive in you!"](images/we-believe.png){fig-alt="Header text 'R learners' above five friendly monsters holding up signs that together read 'we believe in you.'" width="800"} + +# Independent study following the workshop + +[Consolidate](study_after_workshop.qmd) + +# The Code file + +These contain all the code needed in the workshop even where it is not visible on the webpage. + +The [workshop.qmd](workshop.qmd) file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Rmd file in RStudio. Alternatively, [View in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs2/week-1/workshop.qmd).Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---` + +Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra] + +# References diff --git "a/r4babs4/week-2/images/66I-exp-des-TNF-\316\261.png" "b/r4babs4/week-2/images/66I-exp-des-TNF-\316\261.png" new file mode 100644 index 00000000..8d73a260 Binary files /dev/null and "b/r4babs4/week-2/images/66I-exp-des-TNF-\316\261.png" differ diff --git a/r4babs4/week-2/images/answer.png b/r4babs4/week-2/images/answer.png new file mode 100644 index 00000000..77438911 Binary files /dev/null and b/r4babs4/week-2/images/answer.png differ diff --git a/r4babs4/week-2/images/do_in_R.png b/r4babs4/week-2/images/do_in_R.png new file mode 100644 index 00000000..a531b10d Binary files /dev/null and b/r4babs4/week-2/images/do_in_R.png differ diff --git a/r4babs4/week-2/images/do_on_internet.png b/r4babs4/week-2/images/do_on_internet.png new file mode 100644 index 00000000..ec33100b Binary files /dev/null and b/r4babs4/week-2/images/do_on_internet.png differ diff --git a/r4babs4/week-2/images/do_on_your_computer.png b/r4babs4/week-2/images/do_on_your_computer.png new file mode 100644 index 00000000..4726f307 Binary files /dev/null and b/r4babs4/week-2/images/do_on_your_computer.png differ diff --git a/r4babs4/week-2/overview.qmd b/r4babs4/week-2/overview.qmd new file mode 100644 index 00000000..4915f7d7 --- /dev/null +++ b/r4babs4/week-2/overview.qmd @@ -0,0 +1,41 @@ +--- +title: "Overview" +subitle: "Introduction to statistical models: Single linear regression" +toc: true +toc-location: right +--- + +This week you will be introduced to the idea of a statistical "model" in general and to general linear model in particular. Our first general linear model will be single linear regression which puts a line of best fit through data so the response can be predicted from the explanatory variable. We will consider the two "parameters" estimated by the model (the slope and the intercept) and whether these differ from zero + +### Learning objectives + +The successful student will be able to: + +- explain what is meant by a statistical model and fitting a model + +- know what the general linear model is and how it relates to regression + +- explain the principle of regression and know when it can be applied + +- apply and interpret a simple linear regression in R + +- evaluate whether the assumptions of regression are met + +- scientifically report a regression result including appropriate figures + +### Instructions + +1. [Prepare](study_before_workshop.qmd) + + i. 📖 Read What is a statistical model + ii. 📖 Read Single linear regression + +2. [Workshop](workshop.qmd) + + i.💻 Carry out a single linear regression + +3. [Consolidate](study_after_workshop.qmd) + + i. 💻 Appropriately analyse the relationsip between juvenile hormone and mandible size in stage beetles + ii. 💻 Appropriately analyse the relationsip between anxiety and performance + diff --git a/r4babs4/week-2/study_after_workshop.qmd b/r4babs4/week-2/study_after_workshop.qmd new file mode 100644 index 00000000..1d5799d0 --- /dev/null +++ b/r4babs4/week-2/study_after_workshop.qmd @@ -0,0 +1,114 @@ +--- +title: "Independent Study to consolidate this week" +subtitle: "Introduction to statistical models: Single linear regression" +toc: true +toc-location: right +format: + html: + code-fold: true + code-summary: "Answer - don't look until you have tried!" +--- + +# Set up + +If you have just opened RStudio you will want to load the **`tidyverse`** package + +```{r} +#| code-fold: false +library(tidyverse) +``` + +# Exercises + +1. 💻 Effect of anxiety status and sporting performance. The data in [sprint.txt](data-raw/sprint.txt) are from an investigation of the effect of anxiety status and sporting performance. A group of 40 100m sprinters undertook a psychometric test to measure their anxiety shortly before competing. The data are their anxiety scores and the 100m times achieved. What you do conclude from these data? + + +```{r} +#| output: false + +# this example is designed to emphasise the importance of plotting your data first +sprint <- read_table("data-raw/sprint.txt") +# Anxiety is discrete but ranges from 16 to 402 meaning the gap between possible measures is small and +# the variable could be treated as continuous if needed. Time is a continuous measure that has decimal places and which we would expect to follow a normal distribution + +# explore with a plot +ggplot(sprint, aes(x = anxiety, y = time) ) + + geom_point() + +# A scatterplot of the data clearly reveals that these data are not linear. There is a good relationship between the two variables but since it is not linear, single linear regression is not appropriate. + +``` + + +2. 💻 Juvenile hormone in stag beetles. +The concentration of juvenile hormone in stag beetles is known to influence mandible growth. Groups of stag beetles were injected with different concentrations of juvenile hormone (arbitrary units) and their average mandible size (mm) determined. The experimenters planned to analyse their data with regression. The data are in [stag.txt](data-raw/stag.txt) + +```{r} +#| output: false + +# read the data in and check the structure +stag <- read_table("data-raw/stag.txt") +str(stag) + +# jh is discrete but ordered and has been chosen by the experimenter - it is the explanatory variable. +# the response is mandible size which has decimal places and is something we would expect to be +# normally distributed. So far, common sense suggests the assumptions of regression are met. +``` + +```{r} +#| output: false + +# exploratory plot +ggplot(stag, aes(x = jh, y = mand)) + + geom_point() +# looks linear-ish on the scatter +# regression still seems appropriate +# we will check the other assumptions after we have run the lm +``` + +```{r} +#| output: false + +# build the statistical model +mod <- lm(data = stag, mand ~ jh) + +# examine it +summary(mod) +# mand = 0.032*jh + 0.419 +# the slope of the line is significantly different from zero / the jh explains a significant amount of the variation in mand (ANOVA: F = 16.63; d.f. = 1,14; p = 0.00113). +# the intercept is 0.419 and differs significantly from zero +``` + +```{r} +#| output: false + +# checking the assumption +plot(mod, which = 1) +# we're looking for the variance in the residuals to be equal along the x axis. +# with a small data set there is some apparent heterogeneity but it doesn't look too. +# +hist(mod$residuals) +# We have some skew which again might be partly a result of a small sample size. +shapiro.test(mod$residuals) # the also test not sig diff from normal + +# On balance the use of regression is probably justifiable but it is borderline +# but ideally the experiment would be better if multiple individuals were measure at +# each of the chosen juvenile hormone levels. +``` + +```{r} +#| output: false + +# a better plot +ggplot(stag, aes(x = jh, y = mand) ) + + geom_point() + + geom_smooth(method = lm, se = FALSE, colour = "black") + + scale_x_continuous(name = "Juvenile hormone (arbitrary units)", + expand = c(0, 0), + limits = c(0, 32)) + + scale_y_continuous(name = "Mandible size (mm)", + expand = c(0, 0), + limits = c(0, 2)) + + theme_classic() + +``` \ No newline at end of file diff --git a/r4babs4/week-2/study_before_workshop.qmd b/r4babs4/week-2/study_before_workshop.qmd new file mode 100644 index 00000000..9d42d75d --- /dev/null +++ b/r4babs4/week-2/study_before_workshop.qmd @@ -0,0 +1,14 @@ +--- +title: "Independent Study to prepare for workshop" +subtitle: "Introduction to statistical models: Single linear regression" +toc: true +toc-location: right +--- + +1. 📖 Read [What is a statistical model](https://3mmarand.github.io/comp4biosci/what_statistical_model.html) + +2. 📖 Read [Single linear regression](https://3mmarand.github.io/comp4biosci/single_linear_regression.html) + + + + diff --git a/r4babs4/week-2/workshop.qmd b/r4babs4/week-2/workshop.qmd new file mode 100644 index 00000000..502b8cd4 --- /dev/null +++ b/r4babs4/week-2/workshop.qmd @@ -0,0 +1,386 @@ +--- +title: "Workshop" +subtitle: "Introduction to statistical models: Single linear regression" +toc: true +toc-location: right +--- + +```{r} +#| include: false +library(tidyverse) +library(kableExtra) +``` + +# Introduction + +In this workshop you will get practice in applying, interpreting and reporting single linear regression. + + + +![Artwork by @allison_horst: "linear regression dragons"](images/linear-regression-dragon.png){fig-alt="Two blue dragons stand on scales next to a vertical yardstick showing one slightly taller than the other. Regression estimates are shown at the top as an equation: 'weight (tons) = 2.4 + 0.3*height', with explanatory text reading 'If all other variables are constant, we expect a 1 foot taller dragon to weight 0.3 tons more, on average.'" width="800"} + +## Session overview + +In this session you will carry out, interpret and report on a single linear regression. + +## Philosophy + +Workshops are not a test. It is expected that you often don't know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips + +- don't worry about making mistakes +- don't let what you can not do interfere with what you can do +- discussing code with your neighbours will help +- look things up in the independent study material +- look things up in your own code from earlier +- there are no stupid questions + +::: callout-note +## Key + +These four symbols are used at the beginning of each instruction so you know where to carry out the instruction. + +![](images/do_on_your_computer.png) Something you need to do on your computer. It may be opening programs or documents or locating a file. + +![](images/do_in_R.png) Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files. + +![](images/do_on_internet.png) Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file. + +![](images/answer.png) A question for you to think about and answer. Record your answers in your script for future reference. +::: + +# Getting started + +![](images/do_on_your_computer.png) Start RStudio from the Start menu. + +![](images/do_in_R.png) Go the Files tab in the lower right pane and click on the `...` on the right. This will open a "Go to folder" window. Navigate to a place on your computer where you keep your work. Click Open. + +![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. Navigate to the `data-analysis-in-r-2` folder and name the RStudio Project `week-2`. + +![](images/do_in_R.png) Make new folders called `data-raw` and `figures`. You can do this on the Files Pane by clicking New Folder and typing into the box that appears. + +![](images/do_in_R.png) Make a new script then save it with a name like `single-linear-regression.R` to carry out the rest of the work. + +![](images/do_in_R.png) Add a comment to the script: `# Introduction to statistical models: Single linear-regression` and load the **`tidyverse`** [@tidyverse] package + + +# Exercises + +## Linear Regression + +The data in [plant.xlsx](data-raw/plant.xlsx) is a set of observations of plant growth over two months. The researchers planted the seeds and harvested, dried and weighed a plant each day from day 10 so all the data points are independent of each other. + +![](images/do_in_R.png) Save a copy of [plant.xlsx](data-raw/plant.xlsx) to your `data-raw` folder and import it. + +```{r} +#| include: false + +#---CODING ANSWER--- +# +# we need the readxl package that was introduced in the last workshop +library(readxl) + +# assign file name to variable because I'll use it more than once +file <- "data-raw/plant.xlsx" + +# list the sheets in the file +excel_sheets(file) + +# You might want to open th file in excel to make sure you know what is in it. + +# I'm a bit lazy - and less worried by being wrong - so will just guess the are data in the one named sheet. If I guess wrong, then I might open the file! +plant <- read_excel(file, sheet = "plant") + +``` + + +![](images/answer.png) What type of variables do you have? Which is the response and which is the explanatory? What is the null hypothesis? + + + + + + + + + + + +### Exploring + +![](images/do_in_R.png) Do a quick plot of the data: +```{r} +ggplot(plant, aes(x = day, y = mass)) + + geom_point() +``` + +![](images/answer.png) What are the assumptions of linear regression? Do these seem to be met? + + + + + + + + + + + + + + +### Applying, interpreting and reporting + +![](images/do_in_R.png) We now carry out a regression assigning the result of the `lm()` procedure to a variable and examining it with `summary()`. +```{r} +mod <- lm(data = plant, mass ~ day) +summary(mod) +``` + +The Estimates in the Coefficients table give the intercept (first line) and the slope (second line) of the best fitting straight line. The *p*-values on the same line are tests of whether that coefficient is different from zero. + +The *F* value and *p*-value in the last line are a test of whether the model as a whole explains a significant amount of variation in the dependent variable. For a single linear regression this is exactly equivalent to the test of the slope against zero. + +![](images/answer.png) What is the equation of the line? What do you conclude from the analysis? + + + + + + + + + +![](images/answer.png) Does the line go through (0,0)? + + + + + + + +![](images/answer.png) What percentage of variation is explained by the line? + + + + + +It might be useful to assign the slope and the intercept to variables in case we need them later. The can be accessed in the `mod$coefficients` variable: + +```{r} +mod$coefficients +``` + +![](images/do_in_R.png) Assign `mod$coefficients[1]` to `b0` and `mod$coefficients[1]` to `b1`: + +```{r} +b0 <- mod$coefficients[1] |> round(2) +b1 <- mod$coefficients[2] |> round(2) +``` +I also rounded the values to two decimal places. + + + +### Checking assumptions + +We need to examine the residuals. Very conveniently, the object which is created by `lm()` contains a variable called `$residuals`. Also conveniently, the R's `plot()` function can used on the output objects of `lm()`. The assumptions demand that each y is drawn from a normal distribution for each x and these normal distributions have the same variance. Therefore we plot the residuals against the fitted values to see if the variance is the same for all the values of x. +The fitted - predicted - values are the values on the line of best fit. Each residual is the difference between the fitted values and the observed value. + +![](images/do_in_R.png) Plot the model residuals against the fitted values like this: +```{r} +plot(mod, which = 1) +``` + + +![](images/answer.png) What to you conclude? + + + + + + +To examine normality of the model residuals we can plot them as a histogram and do a normality test on them. + +![](images/do_in_R.png) Plot a histogram of the residuals: +```{r} + +ggplot(mapping = aes(x = mod$residuals)) + + geom_histogram(bins = 10) +``` + + + +![](images/do_in_R.png) Use the `shapiro.test()` to test the normality of the model residuals +```{r} +shapiro.test(mod$residuals) +``` + +Usually, when we are doing statistical tests we would like the the test to be significant because it means we have evidence of a biological effect. However, when doing normality tests we hope it will not be significant. A non-significant result means that there is no significant difference between the distribution of the residuals and a normal distribution and that indicates the assumptions are met. + +![](images/answer.png) What to you conclude? + + + + + + + + + +### Illustrating + +We want a figure with the points and the statistical model, i.e., the best fitting straight line. + +![](images/do_in_R.png) Create a scatter plot using `geom_point()` + + +```{r} +ggplot(plant, aes(x = day, y = mass)) + + geom_point() + + theme_classic() +``` + + ![](images/do_in_R.png) The `geom_smooth()` function will had a variety of fitted lines to a plot. We want a line so we need to specify `method = "lm"`: + +```{r} +ggplot(plant, aes(x = day, y = mass)) + + geom_point() + + geom_smooth(method = lm, + se = FALSE, + colour = "black") + + theme_classic() +``` + +![](images/do_in_R.png) What do the `se` and `colour` arguments do? Try changing them. + + +![](images/do_in_R.png) Let's add the equation of the line to the figure using `annotate()`: + +```{r} +ggplot(plant, aes(x = day, y = mass)) + + geom_point() + + geom_smooth(method = lm, + se = FALSE, + colour = "black") + + annotate("text", x = 20, y = 110, + label = "mass = 1.61 * day - 8.68") + + theme_classic() + +``` + +We have to tell `annotate()` what type of geom we want - `text` in this case, - where to put it, and the text we want to appear. + +```{r} +#| include: false + +#---BONUS CODING ANSWER--- +# I'm not expecting you to know about this but it's a neat trick. The glue package allows +# us to use R objects in strings so their values are printed. +ggplot(plant, aes(x = day, y = mass)) + + geom_point() + + geom_smooth(method = lm, + se = FALSE, + colour = "black") + + scale_x_continuous(name = "Day", + limits = c(0, 65), + expand = c(0,0)) + + scale_y_continuous(name = "Mass (g)", + limits = c(0, 120), + expand = c(0,0)) + + annotate("text", x = 20, y = 110, + label = glue::glue("mass = {b1} day {b0}")) + + theme_classic() +``` + + +![](images/do_in_R.png) Improve the axes. You may need to refer back +[Changing the axes](../../r4babs1/week-7/workshop.html#changing-the-axes) from the Week 7 workshop in BABS1 [@rand2023] + +```{r} +#| include: false + +#---CODING ANSWER--- +# + +ggplot(plant, aes(x = day, y = mass)) + + geom_point() + + geom_smooth(method = lm, + se = FALSE, + colour = "black") + + scale_x_continuous(name = "Day", + limits = c(0, 65), + expand = c(0,0)) + + scale_y_continuous(name = "Mass (g)", + limits = c(0, 120), + expand = c(0,0)) + + annotate("text", x = 20, y = 110, + label = "mass = 1.61 * day - 8.68") + + theme_classic() + + +# note that we need to use the continuous versions of both scale_ +``` + + +![](images/do_in_R.png) Save your figure to your `figures` folder. + +```{r} +#| include: false + +#---CODING ANSWER--- + +# Assign the figure to a variable +fig2 <- ggplot(plant, aes(x = day, y = mass)) + + geom_point() + + geom_smooth(method = lm, + se = FALSE, + colour = "black") + + scale_x_continuous(name = "Day", + limits = c(0, 65), + expand = c(0,0)) + + scale_y_continuous(name = "Mass (g)", + limits = c(0, 120), + expand = c(0,0)) + + annotate("text", x = 20, y = 110, + label = "mass = 1.61 * day - 8.68") + + theme_classic() + +``` + + +```{r} +#| include: false + +#---CODING ANSWER--- + +# save figure to figures/plant-growth.png +ggsave("figures/plant-growth.png", + plot = fig2, + width = 3.5, + height = 3.5, + units = "in", + dpi = 300) + +``` + +## Look after future you! + + +You're finished! + +# 🥳 Well Done! 🎉 + +![Artwork by @allison_horst: "length typos"](images/lenght-length.png){fig-alt="Comic panels of an alligator trying to debug some code. First panel: A confident looking alligator gets an error message. Second panel: a few minutes later, the error remains and the alligator is looking carefully at their code. Third panel: 10 minutes after that, the error remains and the alligator is giving a frustrated 'RAAAR' while desperately typing. Fourth panel: The error remains, and the alligator looks exhausted and exasperated, and a thought bubble reads 'maybe it's a bug'. Fifth panel: A friendly flamingo comes over to take a look, and reads aloud from the problematic code a spelling error: 'L-E-N-G-H-T'. Only the tail of the alligator is visible as it stomp stomp stomps out of the panel roaring." width="800"} + +# Independent study following the workshop + +[Consolidate](study_after_workshop.qmd) + +# The Code file + +These contain all the code needed in the workshop even where it is not visible on the webpage. + +The [workshop.qmd](workshop.qmd) file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Rmd file in RStudio. Alternatively, [View in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs2/week-2/workshop.qmd).Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---` + +Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra] + +# References diff --git a/r4babs4/week-3/images/answer.png b/r4babs4/week-3/images/answer.png new file mode 100644 index 00000000..77438911 Binary files /dev/null and b/r4babs4/week-3/images/answer.png differ diff --git a/r4babs4/week-3/images/do_in_R.png b/r4babs4/week-3/images/do_in_R.png new file mode 100644 index 00000000..a531b10d Binary files /dev/null and b/r4babs4/week-3/images/do_in_R.png differ diff --git a/r4babs4/week-3/images/do_on_internet.png b/r4babs4/week-3/images/do_on_internet.png new file mode 100644 index 00000000..ec33100b Binary files /dev/null and b/r4babs4/week-3/images/do_on_internet.png differ diff --git a/r4babs4/week-3/images/do_on_your_computer.png b/r4babs4/week-3/images/do_on_your_computer.png new file mode 100644 index 00000000..4726f307 Binary files /dev/null and b/r4babs4/week-3/images/do_on_your_computer.png differ diff --git a/r4babs4/week-3/overview.qmd b/r4babs4/week-3/overview.qmd new file mode 100644 index 00000000..afe5aa1f --- /dev/null +++ b/r4babs4/week-3/overview.qmd @@ -0,0 +1,47 @@ +--- +title: "Overview" +subtitle: "Two-sample tests" +toc: true +toc-location: right +--- + +This week you will how to use and interpret the general linear model when the *x* variable is categorical and has two groups. Just as with single linear regression, the model puts a line of best through data and the model parameters, the intercept and the slope, have the same in interpretation The intercept is one of the group means and the slope is the difference between that, mean and the other group mean. You will also learn about the non-parametric equivalents - the tests we use when the assumptions of the general linear model are not met. + + +### Learning objectives + +The successful student will be able to: + +- understand the principles of two-sample tests + +- appreciate that two-sample tests with `lm()` are based on the normal distribution +and thus have assumptions + +- appropriately select parametric and non-parametric two-sample tests + +- appropriately select paired and and unpaired two-sample tests + +- apply and interpret `lm()`and `wilcox.test()` + +- evaluate whether the assumptions of `lm()` are met + +- scientifically report a two-sample test result including appropriate figures + +### Instructions + +1. [Prepare](study_before_workshop.qmd) + + i. 📖 Read Two-Sample tests + + +2. [Workshop](workshop.qmd) + + i. 💻 Parametric two-sample test + ii. 💻 Non-parametric two-sample test + iii. 💻 Parametric paired-sample test + +3. [Consolidate](study_after_workshop.qmd) + + i. 💻 Appropriately test whether a genetic modification was successful in increasing omega 3 fatty acids in *Cannabis sativa*. + + ii. 💻 .... diff --git a/r4babs4/week-3/study_after_workshop.qmd b/r4babs4/week-3/study_after_workshop.qmd new file mode 100644 index 00000000..ea6921da --- /dev/null +++ b/r4babs4/week-3/study_after_workshop.qmd @@ -0,0 +1,125 @@ +--- +title: "Independent Study to consolidate this week" +subtitle: "Two-sample tests" +toc: true +toc-location: right +format: + html: + code-fold: true + code-summary: "Answer - don't look until you have tried!" +--- + +# Set up + +If you have just opened RStudio you will want to load the **`tidyverse`** package + +```{r} +#| code-fold: false +library(tidyverse) +``` + +# Exercises + +1. 💻 Plant Biotech. Some plant biotechnologists are trying to increase the quantity of omega 3 fatty acids in Cannabis sativa. They have developed a genetically modified line using genes from *Linum usitatissimum* (linseed). They grow 50 wild type and fifty modified plants to maturity, collect the seeds and determine the amount of omega 3 fatty acids. The data are in [csativa.txt](data-raw/csativa.txt). Do you think their modification has been successful? + + +```{r} +#| output: false + +csativa <- read_table("data-raw/csativa.txt") +str(csativa) + +# First realise that this is a two sample test. You have two independent samples +# - there are a total of 100 different plants and the values in one +# group have no relationship to the values in the other. +``` + +```{r} +#| output: false + +# create a rough plot of the data +ggplot(data = csativa, aes(x = plant, y = omega)) + + geom_violin() +# note the modified plants seem to have lower omega! +``` + +```{r} +#| output: false + +# create a summary of the data +csativa_summary <- csativa %>% + group_by(plant) %>% + summarise(mean = mean(omega), + std = sd(omega), + n = length(omega), + se = std/sqrt(n)) +``` + +```{r} +#| output: false + +# The data seem to be continuous so it is likely that a parametric test will be fine +# we will check the other assumptions after we have run the lm + +# build the statistical model +mod <- lm(data = csativa, omega ~ plant) + + +# examine it +summary(mod) +# So there is a significant difference but you need to make sure you know the direction! +# Wild plants have a significantly higher omega 3 content (mean +/- s.e = 56.41 +/- 1.11) +# than modified plants (49.46 +/- 0.82)(t = 5.03; d.f. = 98; p < 0.0001). +``` + +```{r} +#| output: false + +# let's check the assumptions +plot(mod, which = 1) +# we're looking for the variance in the residuals to be the same in both groups. +# This looks OK. Maybe a bit higher in the wild plants (with the higher mean) + +hist(mod$residuals) +shapiro.test(mod$residuals) +# On balance the use of lm() is probably justifiable The variance isn't quite equal +# and the histogram looks a bit off normal but the normality test is NS and the +# effect (in the figure) is clear. +``` + +```{r} +#| output: false + +# A figure +fig1 <- ggplot() + + geom_point(data = csativa, aes(x = plant, y = omega), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50") + + geom_errorbar(data = csativa_summary, + aes(x = plant, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = csativa_summary, + aes(x = plant, ymin = mean, ymax = mean), + width = 0.2) + + scale_x_discrete(name = "Plant type", labels = c("GMO", "WT")) + + scale_y_continuous(name = "Amount of Omega 3 (units)", + expand = c(0, 0), + limits = c(0, 90)) + + annotate("segment", x = 1, xend = 2, + y = 80, yend = 80, + colour = "black") + + annotate("text", x = 1.5, y = 85, + label = expression(italic(p)~"< 0.001")) + + theme_classic() + +# save figure to figures/csativa.png +ggsave("figures/csativa.png", + plot = fig1, + width = 3.5, + height = 3.5, + units = "in", + dpi = 300) + +``` + +1. 💻 another example \ No newline at end of file diff --git a/r4babs4/week-3/study_before_workshop.qmd b/r4babs4/week-3/study_before_workshop.qmd new file mode 100644 index 00000000..584e4153 --- /dev/null +++ b/r4babs4/week-3/study_before_workshop.qmd @@ -0,0 +1,13 @@ +--- +title: "Independent Study to prepare for workshop" +subtitle: "Two-sample tests" +toc: true +toc-location: right +--- + +1. Prepare + + i. 📖 Read [Two-Sample tests](https://3mmarand.github.io/comp4biosci/two_sample_tests.html) + + + diff --git a/r4babs4/week-3/workshop.qmd b/r4babs4/week-3/workshop.qmd new file mode 100644 index 00000000..8c0ed2ed --- /dev/null +++ b/r4babs4/week-3/workshop.qmd @@ -0,0 +1,534 @@ +--- +title: "Workshop" +subtitle: "Two-sample tests" +toc: true +toc-location: right +--- + +```{r} +#| include: false +library(tidyverse) +library(kableExtra) +``` + +# Introduction + +![Artwork by @allison_horst: "How much I think I know about R"](images/how-much-i-know.png){fig-alt="Illustrated line plot of 'How much I think I know about R' on the y-axis, and 'Time' on the x-axis. Along the line are emoji-style faces, showing the non-linear progression of R knowledge over time. At first, a nervous face becomes a happy face early on in learning, then a grimace face at an intermediate peak before a steep decline (with an exhausted face at the local minimum). Then, a determined face charges back up a hill, reaching another peak with a mind-blown face and text annotation 'join R community on twitter' followed by another decline, but this time the faces look happy even though their 'How much I think I know about R' value is declining." width="800"} + +## Session overview + +In this workshop you will get practice in choosing between, performing, and presenting the results of, two-sample tests and their non-parametric equivalents in R. + +## Philosophy + +Workshops are not a test. It is expected that you often don't know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips + +- don't worry about making mistakes +- don't let what you can not do interfere with what you can do +- discussing code with your neighbours will help +- look things up in the independent study material +- look things up in your own code from earlier +- there are no stupid questions + +::: callout-note +## Key + +These four symbols are used at the beginning of each instruction so you know where to carry out the instruction. + +![](images/do_on_your_computer.png) Something you need to do on your computer. It may be opening programs or documents or locating a file. + +![](images/do_in_R.png) Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files. + +![](images/do_on_internet.png) Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file. + +![](images/answer.png) A question for you to think about and answer. Record your answers in your script for future reference. +::: + +# Getting started + +![](images/do_on_your_computer.png) Start RStudio from the Start menu. + +![](images/do_in_R.png) Go the Files tab in the lower right pane and click on the `...` on the right. This will open a "Go to folder" window. Navigate to a place on your computer where you keep your work. Click Open. + +![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. Navigate to the `data-analysis-in-r-2` folder and name the RStudio Project `week-3`. + +![](images/do_in_R.png) Make new folders called `data-raw` and `figures`. You can do this on the Files Pane by clicking New Folder and typing into the box that appears. + +![](images/do_in_R.png) Make a new script then save it with a name like `two-sample-tests.R` to carry out the rest of the work. + +![](images/do_in_R.png) Add a comment to the script: `# Two-sample tests` and load the **`tidyverse`** [@tidyverse] package + +# Exercises + +## Adiponectin secretion + +Adiponectin is exclusively secreted from adipose tissue and modulates a number of metabolic processes. Nicotinic acid can affect adiponectin secretion. 3T3-L1 adipocytes were treated with nicotinic acid or with a control treatment and adiponectin concentration (pg/mL) measured. The data are in [adipocytes.txt](data-raw/adipocytes.txt). Each row represents an independent sample of adipocytes and the first column gives the concentration adiponectin and the second column indicates whether they were treated with nicotinic acid or not. + +![](images/do_on_your_computer.png) Save a copy of [adipocytes.txt](data-raw/adipocytes.txt) to `data-raw` + +![](images/do_in_R.png) Read in the data and check the structure. I used the name `adip` for the dataframe/tibble. + +```{r} +#| include: false + +#| #---CODING ANSWER--- +adip <- read_table("data-raw/adipocytes.txt") +str(adip) +``` + +We have a tibble containing two variables: `adiponectin` is the response and is continuous and `treatment` is explanatory. `treatment` is categorical with two levels (groups). The first task is visualise the data to get an overview. For continuous response variables with categorical explanatory variables you could use `geom_point()`, `geom_boxplot()` or a variety of other geoms. I often use `geom_violin()` which allows us to see the distribution - the violin is fatter where there are more data points. + + + +![](images/do_in_R.png) Do a quick plot of the data: +```{r} +ggplot(data = adip, aes(x = treatment, y = adiponectin)) + + geom_violin() +``` + + +### Summarising the data + +Summarising the data for each treatment group is the next sensible step. The most useful summary statistics are the means, standard deviations, sample sizes and standard errors. + +![](images/do_in_R.png) Create a data frame called `adip_summary` that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples. You may need to the [Summarise](../../r4babs1/week-9/workshop.html#summarise) from the Week 9 workshop of BABS1 [@rand2023] + +```{r} +#| include: false + +#| #---CODING ANSWER--- + +adip_summary <- adip %>% + group_by(treatment) %>% + summarise(mean = mean(adiponectin), + std = sd(adiponectin), + n = length(adiponectin), + se = std/sqrt(n)) + +``` + +You should get the following numbers: + +```{r} +#| echo: false +knitr::kable(adip_summary) %>% + kableExtra::kable_styling() +``` + + +### Selecting a test + +![](images/answer.png) Do you think this is a paired-sample test or two-sample test? + + + + + + + + + +### Applying, interpreting and reporting + +![](images/do_in_R.png) Create a two-sample model like this: + +```{r} +mod <- lm(data = adip, + adiponectin ~ treatment) +``` + + +![](images/do_in_R.png) Examine the model with: + +```{r} +summary(mod) +``` + + + +![](images/answer.png) What do you conclude from the test? Write your conclusion in a form suitable for a report. + + + + + + + + + + +### Check assumptions + +The assumptions of the general linear model are that the residuals – the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. To check these we can examine the `mod$residuals` variable. You may want to refer to [Checking assumptions](../week-2/workshop.html#checking-assumptions) in the "Single regression" workshop. + +![](images/do_in_R.png) Plot the model residuals against the fitted values. +```{r} +#| include: false + +#---CODING ANSWER- + +plot(mod, which = 1) +``` + + +![](images/answer.png) What to you conclude? + + + + + + +To examine normality of the model residuals we can plot them as a histogram and do a normality test on them. + +![](images/do_in_R.png) Plot a histogram of the residuals. +```{r} +#| include: false + +#---CODING ANSWER- + +ggplot(mapping = aes(x = mod$residuals)) + + geom_histogram(bins = 10) +``` + + + +![](images/do_in_R.png) Use the `shapiro.test()` to test the normality of the model residuals +```{r} +#| include: false + +#---CODING ANSWER- + +shapiro.test(mod$residuals) +``` + +![](images/answer.png) What to you conclude? + + + + + + + + +### Illustrating + +![](images/do_in_R.png) Create a figure like the one below. You may need to refer to [Visualise](../../r4babs1/week-9/workshop.html#visualise) from the "Summarising data with several variables" workshop [@rand2023] + +```{r} +#| echo: false + +#---CODING ANSWER--- +ggplot() + + geom_point(data = adip, aes(x = treatment, y = adiponectin), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50") + + geom_errorbar(data = adip_summary, + aes(x = treatment, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = adip_summary, + aes(x = treatment, ymin = mean, ymax = mean), + width = 0.2) + + scale_y_continuous(name = "Adiponectin (pg/mL)", + limits = c(0, 12), + expand = c(0, 0)) + + scale_x_discrete(name = "Treatment", + labels = c("Control", "Nicotinic acid")) + + theme_classic() + +``` + + +We now need to annotate the figure with the results from the statistical test. This most commonly done with a line linking the means being compared and the *p*-value. The `annotate()` function can be used to draw the line and then to add the value. The line is a `segment` and the *p*-value is a `text`. + +![](images/do_in_R.png) Add annotation to the figure by adding: +```r +...... + + annotate("segment", x = 1, xend = 2, + y = 11.3, yend = 11.3, + colour = "black") + + annotate("text", x = 1.5, y = 11.7, + label = expression(italic(p)~"= 0.003")) + + theme_classic() +``` + +```{r} +#| echo: false + +ggplot() + + geom_point(data = adip, aes(x = treatment, y = adiponectin), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50") + + geom_errorbar(data = adip_summary, + aes(x = treatment, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = adip_summary, + aes(x = treatment, ymin = mean, ymax = mean), + width = 0.2) + + scale_y_continuous(name = "Adiponectin (pg/mL)", + limits = c(0, 12), + expand = c(0, 0)) + + scale_x_discrete(name = "Treatment", + labels = c("Control", "Nicotinic acid")) + + annotate("segment", x = 1, xend = 2, + y = 11.3, yend = 11.3, + colour = "black") + + annotate("text", x = 1.5, y = 11.7, + label = expression(italic(p)~"= 0.003")) + + theme_classic() +``` + + + + +For the segment, `annotate()` needs the *x* and *y* coordinates for the start and the finish of the line. + +The use of `expression()` allows you to specify formatting or special characters. `expression()` takes strings or [LaTeX](https://en.wikipedia.org/wiki/LaTeX) formatting. Each string or piece of LaTeX is separated by a `*` or a `~`. The `*` concatenates the strings without a space, `~` does so with a space. +It will generate a warning message "In is.na(x) : is.na() applied to non-(list or vector) of type 'expression'" which can be ignored. + + + +![](images/do_in_R.png) Save your figure to your figures folder. +```{r} +#| include: false + +#---CODING ANSWER--- + +# Assign the figure to a variable +fig1 <- ggplot() + + geom_point(data = adip, aes(x = treatment, y = adiponectin), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50") + + geom_errorbar(data = adip_summary, + aes(x = treatment, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = adip_summary, + aes(x = treatment, ymin = mean, ymax = mean), + width = 0.2) + + scale_y_continuous(name = "Adiponectin (pg/mL)", + limits = c(0, 12), + expand = c(0, 0)) + + scale_x_discrete(name = "Treatment", + labels = c("Control", "Nicotinic acid")) + + annotate("segment", x = 1, xend = 2, + y = 11.3, yend = 11.3, + colour = "black") + + annotate("text", x = 1.5, y = 11.7, + label = expression(italic(p)~"= 0.003")) + + theme_classic() + +``` + + +```{r} +#| include: false + +#---CODING ANSWER--- + +# save figure to figures/adipo.png +ggsave("figures/adipo.png", + plot = fig1, + width = 3.5, + height = 3.5, + units = "in", + dpi = 300) + +``` + +## Grouse Parasites + +Grouse livers were dissected and the number of individuals of a parasitic nematode were counted for two estates 'Gordon' and 'Moss'. We want to know if the two estates have different infection rates. The data are in [grouse.csv](data-raw/grouse.csv) + + +![](images/do_on_your_computer.png) Save a copy of [grouse.csv](data-raw/grouse.csv) to `data-raw` + +![](images/do_in_R.png) Read in the data and check the structure. I used the name `grouse` for the dataframe/tibble. + +```{r} +#| include: false + +#| #---CODING ANSWER--- +grouse <- read_csv("data-raw/grouse.csv") +str(grouse) +``` + + +### Selecting + +![](images/answer.png) Using your common sense, do these data look normally distributed? + + + + + + + +![](images/answer.png) What test do you suggest? + + + + + + +### Applying, interpreting and reporting + +![](images/do_in_R.png) Summarise the data by finding the median of each group: +```{r} +#| include: false + +# ---CODING ANSWER--- +grouse %>% + group_by(estate) %>% + summarise(median(nematodes)) +``` + + +![](images/do_in_R.png) Carry out a two-sample Wilcoxon test (also known as a Mann-Whitney): +```{r} +wilcox.test(data = grouse, nematodes ~ estate) +``` +![](images/answer.png) What do you conclude from the test? Write your conclusion in a form suitable for a report. + + + + + + + + +### Illustrating + +A box plot is a usually good choice for illustrating a two-sample Wilcoxon test because it shows the median and interquartile range. + +![](images/do_in_R.png) We can create a simple boxplot with: +```{r} +ggplot(data = grouse, aes(x = estate, y = nematodes) ) + + geom_boxplot() +``` + +![](images/do_in_R.png) Annotate and format the figure so it is more suitable for a report and save it to your figures folder. +```{r} +#| include: false + +# ---CODING ANSWER--- + +fig2 <- ggplot(data = grouse, aes(x = estate, y = nematodes) ) + + geom_boxplot() + + scale_x_discrete(name = "Estate", + labels = c("Gordon", "Moss")) + + scale_y_continuous(name = "Number of nematodes", + expand = c(0, 0), + limits = c(0, 75)) + + annotate("segment", x = 1, xend = 2, + y = 67, yend = 67, + colour = "black") + + annotate("text", x = 1.5, y = 69, + label = expression(italic(p)~"= 0.035")) + + theme_classic() + +# save figure to figures/grouse.png +ggsave("figures/grouse.png", + plot = fig2, + width = 3, + height = 3, + units = "in", + dpi = 300) +``` + +## Gene Expression + +Bambara groundnut (*Vigna subterranea*) is an African legume with good nutritional value which can be influenced by low temperature stress. Researchers are interested in the expression levels of a particular set of 35 genes (`probe_id`) in response to temperature stress. They measure the expression of the genes at 23 and 18 degrees C (high and low temperature). These samples are **not** independent because we have two measure from one gene. The data are in [expr.xlxs](data-raw/expr.xlsx). + +### Selecting +![](images/answer.png) What is the null hypothesis? + + + + + + + +![](images/do_in_R.png) Save a copy of [expr.xlxs](data-raw/expr.xlsx) and import the data. I named the dataframe `bambara` + +```{r} +#| include: false + +#---CODING ANSWER--- + +# we need the readxl package +library(readxl) +bambara <- read_excel("data-raw/expr.xlsx") +``` + +![](images/answer.png) What is the appropriate parametric test? + + + + + + +### Applying, interpreting and reporting + +A paired test requires us to test whether the difference in expression between high and low temperatures is zero on average. One handy way to achieve this is to organise our groups into two columns. The `pivot_wider()` function will do this for us. We need to tell it what column gives the identifiers (i.e., matches the the pairs) - the probe_ids in this case. We also need to say which variable contains what will become the column names and which contains the values. + +![](images/do_in_R.png) Pivot the data so there is a column for each temperature: +```{r} +bambara <- bambara |> + pivot_wider(names_from = temperature, + values_from = expression, + id_cols = probe_id) +``` + +![](images/do_in_R.png) Click on the `bambara` dataframe in the environment to open a view of it so that you understand what `pivot_wider()` has done. + + +![](images/do_in_R.png) Create a paired-sample model like this: + +```{r} +mod <- lm(data = bambara, + highert - lowert ~ 1) +``` + +Since we have done `highert - lowert`, the "(Intercept) Estimate" will be the average of the higher temperature expression minus the lower temperature expression for each gene. + +![](images/do_in_R.png) Examine the model with: + +```{r} +summary(mod) +``` + + +![](images/answer.png) State your conclusion from the test in a form suitable for including in a report. Make sure you give the direction of any significant effect. + + + + + + + + +## Look after future you! + +The code required to summarise, test, and plot data for any two-sample test AND for any for any one-way ANOVA is **exactly the same** except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it your code so that you can make use of it next week. + +![](images/future_you.png){fig-alt="xxx" width="300"} + + +You're finished! + +# 🥳 Well Done! 🎉 + + + +# Independent study following the workshop + +[Consolidate](study_after_workshop.qmd) + +# The Code file + +These contain all the code needed in the workshop even where it is not visible on the webpage. + +The [workshop.qmd](workshop.qmd) file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Rmd file in RStudio. Alternatively, [View in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs2/week-3/workshop.qmd).Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---` + +Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra] + +# References diff --git a/r4babs4/week-4/data-raw/ECOLIGreen_ISOTYPE.fcs b/r4babs4/week-4/data-raw/ECOLIGreen_ISOTYPE.fcs new file mode 100644 index 00000000..30d214e1 Binary files /dev/null and b/r4babs4/week-4/data-raw/ECOLIGreen_ISOTYPE.fcs differ diff --git a/r4babs4/week-4/data-raw/ECOLIGreen_TNFAPC.fcs b/r4babs4/week-4/data-raw/ECOLIGreen_TNFAPC.fcs new file mode 100644 index 00000000..b3b139d5 Binary files /dev/null and b/r4babs4/week-4/data-raw/ECOLIGreen_TNFAPC.fcs differ diff --git a/r4babs4/week-4/data-raw/LPS_ISOTYPE.fcs b/r4babs4/week-4/data-raw/LPS_ISOTYPE.fcs new file mode 100644 index 00000000..efa48112 Binary files /dev/null and b/r4babs4/week-4/data-raw/LPS_ISOTYPE.fcs differ diff --git a/r4babs4/week-4/data-raw/LPS_TNFAPC.fcs b/r4babs4/week-4/data-raw/LPS_TNFAPC.fcs new file mode 100644 index 00000000..7d5d78cf Binary files /dev/null and b/r4babs4/week-4/data-raw/LPS_TNFAPC.fcs differ diff --git a/r4babs4/week-4/data-raw/MEDIA_ISOTYPE.fcs b/r4babs4/week-4/data-raw/MEDIA_ISOTYPE.fcs new file mode 100644 index 00000000..9613fa78 Binary files /dev/null and b/r4babs4/week-4/data-raw/MEDIA_ISOTYPE.fcs differ diff --git a/r4babs4/week-4/data-raw/MEDIA_TNFAPC.fcs b/r4babs4/week-4/data-raw/MEDIA_TNFAPC.fcs new file mode 100644 index 00000000..9d0b4ce1 Binary files /dev/null and b/r4babs4/week-4/data-raw/MEDIA_TNFAPC.fcs differ diff --git a/r4babs4/week-4/images/answer.png b/r4babs4/week-4/images/answer.png new file mode 100644 index 00000000..77438911 Binary files /dev/null and b/r4babs4/week-4/images/answer.png differ diff --git a/r4babs4/week-4/images/do_in_R.png b/r4babs4/week-4/images/do_in_R.png new file mode 100644 index 00000000..a531b10d Binary files /dev/null and b/r4babs4/week-4/images/do_in_R.png differ diff --git a/r4babs4/week-4/images/do_on_internet.png b/r4babs4/week-4/images/do_on_internet.png new file mode 100644 index 00000000..ec33100b Binary files /dev/null and b/r4babs4/week-4/images/do_on_internet.png differ diff --git a/r4babs4/week-4/images/do_on_your_computer.png b/r4babs4/week-4/images/do_on_your_computer.png new file mode 100644 index 00000000..4726f307 Binary files /dev/null and b/r4babs4/week-4/images/do_on_your_computer.png differ diff --git a/r4babs4/week-4/overview.qmd b/r4babs4/week-4/overview.qmd new file mode 100644 index 00000000..0eaaf271 --- /dev/null +++ b/r4babs4/week-4/overview.qmd @@ -0,0 +1,42 @@ +--- +title: "Overview" +subtitle: "One-way ANOVA and Kruskal-Wallis" +toc: true +toc-location: right + +--- + +Last week you learnt how to use and interpret the general linear model when the *x* variable was categorical with two groups. You will now extend that to situations when there are more than two groups. This is often known as the one-way ANOVA (**an**alysis **o**f **var**iance). You will also learn about the Kruskal- Wallis test which can be used when the assumptions of the general linear model are not met. + +### Learning objectives + +The successful student will be able to: + +- explain the rationale behind ANOVA understand the meaning of the F values + +- select, appropriately, one-way ANOVA and Kruskal-Wallis + +- know what functions are used in R to run these tests and how to interpret them + +- evaluate whether the assumptions of `lm()` are met + +- scientifically report the results of these tests including appropriate figures + + +### Instructions + +1. [Prepare](study_before_workshop.qmd) + + i. 📖 Read One-way ANOVA and Kruskal-Wallis + + +2. [Workshop](workshop.qmd) + + i. 💻 One-way ANOVA + ii. 💻 Kruskal-Wallis + +3. [Consolidate](study_after_workshop.qmd) + + i. 💻 Appropriately test if fitness and acclimation effect the sodium content of sweat + ii. 💻 Appropriately test if insecticides vary in their effectiveness + diff --git a/r4babs4/week-4/study_after_workshop.qmd b/r4babs4/week-4/study_after_workshop.qmd new file mode 100644 index 00000000..b82a8aac --- /dev/null +++ b/r4babs4/week-4/study_after_workshop.qmd @@ -0,0 +1,279 @@ +--- +title: "Independent Study to consolidate this week" +subtitle: "One-way ANOVA and Kruskal-Wallis" +toc: true +toc-location: right +format: + html: + code-fold: true + code-summary: "Answer - don't look until you have tried!" +--- + +# Set up + +If you have just opened RStudio you will want to load the **`tidyverse`** package + +```{r} +#| code-fold: false +library(tidyverse) +``` + +# Exercises + +1. 💻 Sports scientists were investigating the effects of fitness and heat acclimatisation on the sodium content of sweat. They measured the sodium content of the sweat (μmoll^−1) of three groups of individuals: unfit and unacclimatised (UU); fit and unacclimatised(FU); and fit and acclimatised (FA). The are in [sweat.txt](data-raw/sweat.txt). Is there a difference between the groups in the sodium content of their sweat? + + +```{r} +#| output: false + +# read in the data and look at structure +sweat <- read_table("data-raw/sweat.txt") +str(sweat) +``` + + +```{r} +#| output: false + +# quick plot of the data +ggplot(data = sweat, aes(x = gp, y = na)) + + geom_boxplot() +# Since the sample sizes are small and not the same in each group and the +# variance in the FA gp looks a bit lower, I'm leaning to a non-parametric test K-W. +# However, don't panic if you decided to do an anova +``` + + +```{r} +#| output: false +# calculate some summary stats +sweat_summary <- sweat %>% + group_by(gp) %>% + summarise(mean = mean(na), + n = length(na), + median = median(na)) + +``` + +```{r} +#| output: false + +# Kruskal-Wallis +kruskal.test(data = sweat, na ~ gp) +# We can say there is a difference between the groups in the sodium +# content of their sweat (chi-squared = 11.9802, df = 2, p-value = 0.002503). +# Unfit and unacclimatised people have most salty sweat, +# Fit and acclimatised people the least salty. + +``` + +```{r} +#| output: false + +# a post-hoc test to see where the sig differences lie: +library(FSA) +dunnTest(data = sweat, na ~ gp) +# Fit and acclimatised people (median = 49.5 μmoll^−1) have significantly less sodium in their +# sweat than the unfit and unacclimatised people (70 μmoll^−1) +# (Kruskal-Wallis multiple comparison p-values adjusted with the Holm method: p = 0.0026). +# Fit and unacclimatised (54 μmoll^−1) also have significantly less sodium in their +# people have sodium concentrations than unfit and unacclimatised people (p = 0.033). +# There was no difference between the Fit and unacclimatised and the Fit and acclimatised. See figure 1. +``` + + +```{r} +#| output: false +ggplot(sweat, aes(x = gp, y = na) ) + + geom_boxplot() + + scale_x_discrete(labels = c("Fit Acclimatised", + "Fit Unacclimatised", + "Unfit Unacclimatised"), + name = "Group") + + scale_y_continuous(limits = c(0, 110), + expand = c(0, 0), + name = expression("Sodium"~mu*"mol"*l^{-1})) + + annotate("segment", x = 1, xend = 3, + y = 100, yend = 100, + colour = "black") + + annotate("text", x = 2, y = 103, + label = expression(italic(p)~"= 0.0026")) + + annotate("segment", x = 2, xend = 3, + y = 90, yend = 90, + colour = "black") + + annotate("text", x = 2.5, y = 93, + label = expression(italic(p)~"= 0.0340")) + + theme_classic() + +#Figure 1. Sodium content of sweat for three groups: Fit and acclimatised +#(FA), Fit and unacclimatised (FU) and Unfit and unacclimatised (UU). Heavy lines +#indicate the median, boxes the interquartile range and whiskers the range. + + +``` + + + + +2. 💻 The data are given in [biomass.txt](data-raw/biomass.txt) are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Do the insecticides vary in their effectiveness? What advice would you give to a person: - currently using insecticide E? - trying to choose between A and D? - trying to choose between C and B? + +```{r} +#| output: false + +biom <- read_table("data-raw/biomass.txt") +# The data are organised with an insecticide treatment group in +# each column. +``` + +```{r} +#| output: false + +#Put the data into tidy format. + +biom <- biom |> + pivot_longer(cols = everything(), + names_to = "spray", + values_to = "biomass") + +``` + +```{r} +#| output: false + +# quick plot of the data +ggplot(data = biom, aes(x = spray, y = biomass)) + + geom_boxplot() +# Looks like there is a difference between sprays. E doesn't look very effective. +``` + +```{r} +#| output: false + +# summary statistics +biom_summary <- biom %>% + group_by(spray) %>% + summarise(mean = mean(biomass), + median = median(biomass), + sd = sd(biomass), + n = length(biomass), + se = sd / sqrt(n)) +# thoughts so far: the sample sizes are equal, 10 is a smallish but +# reasonable sample size +# the means and medians are similar to each other (expected for +# normally distributed data), A has a smaller variance + +# We have one explanatory variable, "spray" comprising 6 levels +# Biomass has decimal places and we would expect such data to be +# normally distributed therefore one-way ANOVA is the desired test +# - we will check the assumptions after building the model +``` + +```{r} +#| output: false + +# arry out an ANOVA and examine the results +mod <- lm(data = biom, biomass ~ spray) +summary(mod) +# spray type does have an effect F-statistic: 26.46 on 5 and 54 DF, p-value: 2.081e-13 +``` + +```{r} +#| output: false +# Carry out the post-hoc test +library(emmeans) + +emmeans(mod, ~ spray) |> pairs() + +# the signifcant comparisons are: +# contrast estimate SE df t.ratio p.value +# A - D -76.50 21.9 54 -3.489 0.0119 +# A - E -175.51 21.9 54 -8.005 <.0001 +# A - WaterControl -175.91 21.9 54 -8.024 <.0001 +# B - E -154.32 21.9 54 -7.039 <.0001 +# B - WaterControl -154.72 21.9 54 -7.057 <.0001 +# C - E -155.71 21.9 54 -7.102 <.0001 +# C - WaterControl -156.11 21.9 54 -7.120 <.0001 +# D - E -99.01 21.9 54 -4.516 0.0005 +# D - WaterControl -99.41 21.9 54 -4.534 0.0004 +# All sprays are better than the water control except E. +# This is probably the most important result. +# What advice would you give to a person currently using insecticide E? +# Don't bother!! It's no better than water. Switch to any of +# the other sprays +# What advice would you give to a person currently +# + trying to choose between A and D? Choose A because A has sig lower +# insect biomass than D +# + trying to choose between C and B? It doesn't matter because there is +# no difference in insect biomass. Use other criteria to chose (e.g., price) +# We might report this like: +# There is a very highly significant effect of spray type on pest +# biomass (F = 26.5; d.f., 5, 54; p < 0.001). Post-hoc testing +# showed E was no more effective than the control; A, C and B were +# all better than the control but could be equally as good as each +# other; D would be a better choice than the control or E but +# worse than A. See figure 1 + +``` + +```{r} +#| output: false + +# I reordered the bars to make is easier for me to annotate with +# I also used * to indicate significance + +ggplot() + + geom_point(data = biom, aes(x = reorder(spray, biomass), y = biomass), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50") + + geom_errorbar(data = biom_summary, + aes(x = spray, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = biom_summary, + aes(x = spray, ymin = mean, ymax = mean), + width = 0.2) + + scale_y_continuous(name = "Pest Biomass (units)", + limits = c(0, 540), + expand = c(0, 0)) + + scale_x_discrete("Spray treatment") + + # E and control are one group + annotate("segment", x = 4.5, xend = 6.5, + y = 397, yend = 397, + colour = "black", linewidth = 1) + + annotate("text", x = 5.5, y = 385, + label = "N.S", size = 4) + + # WaterControl-D and E-D *** + annotate("segment", x = 4, xend = 5.5, + y = 410, yend = 410, + colour = "black") + + annotate("text", x = 4.5, y = 420, + label = "***", size = 5) + + # WaterControl-B *** + annotate("segment", x = 3, xend = 5.5, + y = 440, yend = 440, + colour = "black") + + annotate("text", x = 4, y = 450, + label = "***", size = 5) + + # WaterControl-C *** + annotate("segment", x = 2, xend = 5.5, + y = 475, yend = 475, + colour = "black") + + annotate("text", x = 3.5, y = 485, + label = "***", size = 5) + + # WaterControl-A *** + annotate("segment", x = 1, xend = 5.5, + y = 510, yend = 510, + colour = "black") + + annotate("text", x = 3.5, y = 520, + label = "***", size = 5) + +# A-D *** + annotate("segment", x = 1, xend = 4, + y = 330, yend = 330, + colour = "black") + + annotate("text", x = 2.5, y = 335, + label = "*", size = 5) + + theme_classic() + +# Figure 1. The mean pest biomass following various insecticide treatments. +# Error bars are +/- 1 S.E. Significant comparisons are indicated: * is p < 0.05, ** p < 0.01 and *** is p < 0.001 + +``` \ No newline at end of file diff --git a/r4babs4/week-4/study_before_workshop.qmd b/r4babs4/week-4/study_before_workshop.qmd new file mode 100644 index 00000000..eaaa22fc --- /dev/null +++ b/r4babs4/week-4/study_before_workshop.qmd @@ -0,0 +1,13 @@ +--- +title: "Independent Study to prepare for workshop" +subtitle: "One-way ANOVA and Kruskal-Wallis" +toc: true +toc-location: right +--- + +1. Prepare + + i. 📖 Read [One-way ANOVA and Kruskal-Wallis](https://3mmarand.github.io/comp4biosci/one_way_anova_and_kw.html) + + + diff --git a/r4babs4/week-4/workshop.qmd b/r4babs4/week-4/workshop.qmd new file mode 100644 index 00000000..62f974b1 --- /dev/null +++ b/r4babs4/week-4/workshop.qmd @@ -0,0 +1,541 @@ +--- +title: "Workshop" +subtitle: "One-way ANOVA and Kruskal-Wallis" +toc: true +toc-location: right +editor: + markdown: + wrap: sentence +--- + +```{r} +#| include: false +library(tidyverse) +library(kableExtra) +``` + +# Introduction + +![Artwork by @allison_horst: "Debugging and feelings"](images/debugging.png){fig-alt="A cartoon of a fuzzy round monster face showing 10 different emotions experienced during the process of debugging code. The progression goes from (1) 'I got this' - looking determined and optimistic; (2) 'Huh. Really thought that was it.' - looking a bit baffled; (3) '...' - looking up at the ceiling in thought; (4) 'Fine. Restarting.' - looking a bit annoyed; (5) 'OH WTF.' Looking very frazzled and frustrated; (6) 'Zombie meltdown.' - looking like a full meltdown; (7) (blank) - sleeping; (8) 'A NEW HOPE!' - a happy looking monster with a lightbulb above; (9) 'insert awesome theme song' - looking determined and typing away; (10) 'I love coding' - arms raised in victory with a big smile, with confetti falling." width="800"} + +## Session overview + +In this session you will get practice in choosing between, performing, and presenting the results of, one-way ANOVA and Kruskal-Wallis in R. + +## Philosophy + +Workshops are not a test. +It is expected that you often don't know how to start, make a lot of mistakes and need help. +It is expected that you are familiar with independent study content before the workshop. +However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. +Tips + +- don't worry about making mistakes +- don't let what you can not do interfere with what you can do +- discussing code with your neighbours will help +- look things up in the independent study material +- look things up in your own code from earlier +- there are no stupid questions + +::: callout-note +## Key + +These four symbols are used at the beginning of each instruction so you know where to carry out the instruction. + +![](images/do_on_your_computer.png) Something you need to do on your computer. +It may be opening programs or documents or locating a file. + +![](images/do_in_R.png) Something you should do in RStudio. +It will often be typing a command or using the menus but might also be creating folders, locating or moving files. + +![](images/do_on_internet.png) Something you should do in your browser on the internet. +It may be searching for information, going to the VLE or downloading a file. + +![](images/answer.png) A question for you to think about and answer. +Record your answers in your script for future reference. +::: + +# Getting started + +![](images/do_on_your_computer.png) Start RStudio from the Start menu. + +![](images/do_in_R.png) Go the Files tab in the lower right pane and click on the `...` on the right. +This will open a "Go to folder" window. +Navigate to a place on your computer where you keep your work. +Click Open. + +![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. +Navigate to the `data-analysis-in-r-2` folder and name the RStudio Project `week-4`. + +![](images/do_in_R.png) Make new folders called `data-raw` and `figures`. You can do this on the Files Pane by clicking New Folder and typing into the box that appears. + +![](images/do_in_R.png) Make a new script then save it with a name like `one-way-anova-and-kw.R` to carry out the rest of the work. + +![](images/do_in_R.png) Add a comment to the script: `# One-way ANOVA and Kruskal-Wallis` and load the **`tidyverse`** [@tidyverse] package + +# Exercises + +## Myoglobin in seal muscle + +The myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in [seal.csv](data-raw/seal.csv). +We want to know if there is a difference between species. +Each row represents an individual seal. +The first column gives the myoglobin concentration and the second column indicates species. + +![](images/do_on_your_computer.png) Save a copy of the data file [seal.csv](data-raw/seal.csv) to `data-raw` + +![](images/do_in_R.png) Read in the data and check the structure. +I used the name `seal` for the dataframe/tibble. + +```{r} +#| include: false + +#---CODING ANSWER--- + +seal <- read_csv("data-raw/seal.csv") +str(seal) +``` + +![](images/answer.png) What kind of variables do you have? + + + + + + + +### Exploring + +![](images/do_in_R.png) Do a quick plot of the data. You may need to refer to a previous workshop + +```{r} +#| include: false + +#---CODING ANSWER--- + +ggplot(data = seal, aes(x = species, y = myoglobin)) + + geom_violin() +``` + +### Summarising the data + +Do you remember [Look after future you!](../week-3/workshop.html#look-after-future-you) + +![](images/do_in_R.png) If you followed that tip you'll be able to open that script and whizz through summarising,testing and plotting. + +![](images/do_in_R.png) Create a data frame called `seal_summary` that contains the means, standard deviations, sample sizes and standard errors for each species. + +```{r} +#| include: false + +#---CODING ANSWER--- + +seal_summary <- seal %>% + group_by(species) %>% + summarise(mean = mean(myoglobin), + std = sd(myoglobin), + n = length(myoglobin), + se = std/sqrt(n)) + +``` + +You should get the following numbers: + +```{r} +#| echo: false + +knitr::kable(seal_summary) %>% + kableExtra::kable_styling() +``` + +### Applying, interpreting and reporting + +We can now carry out a one-way ANOVA using the same `lm()` function we used for two-sample tests. + +![](images/do_in_R.png) Carry out an ANOVA and examine the results with: + +```{r} +mod <- lm(data = seal, myoglobin ~ species) +summary(mod) +``` + +Remember: the tilde (`~`) means test the values in `myoglobin` when grouped by the values in `species.` Or explain `myoglobin` with `species` + +![](images/answer.png) What do you conclude so far from the test? +Write your conclusion in a form suitable for a report. + + + + + + + +![](images/answer.png) Can you relate the values under `Estimate` to the means? + + + + + + + + + + + + + + + +The ANOVA is significant but this only tells us that species matters, meaning at least two of the means differ. +To find out which means differ, we need a post-hoc test. +A post-hoc ("after this") test is done after a significant ANOVA test. +There are several possible post-hoc tests and we will be using Tukey's HSD (honestly significant difference) test [@tukey1949] implemented in the **`emmeans`** [@emmeans] package. + +![](images/do_in_R.png) Load the package + +```{r} +library(emmeans) +``` + +![](images/do_in_R.png) Carry out the post-hoc test + +```{r} +emmeans(mod, ~ species) |> pairs() + +``` + +Each row is a comparison between the two means in the 'contrast' column. The 'estimate' column is the difference between those means and the 'p.value' indicates whether that difference is significant. + +A plot can be used to visualise the result of the post-hoc which can be especially useful when there are very many comparisons. + +![](images/do_in_R.png) Plot the results of the post-hoc test: + +```{r} +emmeans(mod, ~ species) |> plot() +``` + +Where the purple bars overlap, there is no significant difference. + +![](images/answer.png) What do you conclude from the test? + + + + + + + +### Check assumptions + +The assumptions of the general linear model are that the residuals -- the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. +To check these we can examine the `mod$residuals` variable. +You may want to refer to [Checking assumptions](../week-2/workshop.html#checking-assumptions) in the "Single regression" workshop. + +![](images/do_in_R.png) Plot the model residuals against the fitted values. + +```{r} +#| include: false + +#---CODING ANSWER- + +plot(mod, which = 1) +``` + +![](images/answer.png) What to you conclude? + + + + + + + +To examine normality of the model residuals we can plot them as a histogram and do a normality test on them. + +![](images/do_in_R.png) Plot a histogram of the residuals. + +```{r} +#| include: false + +#---CODING ANSWER- + +ggplot(mapping = aes(x = mod$residuals)) + + geom_histogram(bins = 10) +``` + +![](images/do_in_R.png) Use the `shapiro.test()` to test the normality of the model residuals + +```{r} +#| include: false + +#---CODING ANSWER- + +shapiro.test(mod$residuals) +``` + +![](images/answer.png) What to you conclude? + + + + + + + + + +### Illustrating + +![](images/do_in_R.png) Create a figure like the one below. +You may need to refer to [Visualise](../../r4babs1/week-9/workshop.html#visualise) from the "Summarising data with several variables" workshop [@rand2023] + +We will again use both our `seal` and `seal_summary` dataframes. + +![](images/do_in_R.png) Create the plot: + +```{r} +#| echo: false + +#---CODING ANSWER--- +ggplot() + + geom_point(data = seal, aes(x = species, y = myoglobin), + position = position_jitter(width = 0.1, height = 0), + colour = "grey50") + + geom_errorbar(data = seal_summary, + aes(x = species, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = seal_summary, + aes(x = species, ymin = mean, ymax = mean), + width = 0.2) + + scale_y_continuous(name = "Myoglobin (g/kg)", + limits = c(0, 80), + expand = c(0, 0)) + + scale_x_discrete(name = "Species") + + annotate("segment", x = 1, xend = 2, + y = 72, yend = 72, + colour = "black") + + annotate("text", x = 1.5, y = 75, + label = expression(italic(p)~"= 0.005")) + + theme_classic() + +``` + +![](images/do_in_R.png) Save your figure to your figures folder. + +```{r} +#| echo: false + +#---CODING ANSWER--- + +# Assign the figure to a variable +fig1 <- ggplot() + + geom_point(data = seal, aes(x = species, y = myoglobin), + position = position_jitter(width = 0.1, height = 0), + colour = "grey50") + + geom_errorbar(data = seal_summary, + aes(x = species, ymin = mean - se, ymax = mean + se), + width = 0.3) + + geom_errorbar(data = seal_summary, + aes(x = species, ymin = mean, ymax = mean), + width = 0.2) + + scale_y_continuous(name = "Myoglobin (g/kg)", + limits = c(0, 80), + expand = c(0, 0)) + + scale_x_discrete(name = "Species") + + annotate("segment", x = 1, xend = 2, + y = 72, yend = 72, + colour = "black") + + annotate("text", x = 1.5, y = 75, + label = expression(italic(p)~"= 0.005")) + + theme_classic() + +``` + +```{r} +#| include: false + +#---CODING ANSWER--- + +# save figure to figures/seal.png +ggsave("figures/seal.png", + plot = fig1, + width = 3.5, + height = 3.5, + units = "in", + dpi = 300) + +``` + +## Leafminers on Birch + +Larvae of the Ambermarked birch leafminer, *Profenusa thomsoni*, feed on the interior leaf tissues of Birch (Betula) species. +They do not normally kill the tree but can weaken it making it susceptible to attack from other species. +Researchers are interested in whether there is a difference in the rates at which white, grey and yellow birch are attacked. +They introduce adult female *P.thomsoni* to a green house containing 30 young trees (ten of each type) and later count the egg laying events on each tree. +The data are in [leaf.txt](data-raw/leaf.txt). + +### Exploring + +![](images/do_in_R.png) Read in the data and check the structure. +I used the name `leaf` for the dataframe/tibble. + +```{r} +#| include: false + +#---CODING ANSWER--- + +leaf <- read_table("data-raw/leaf.txt") +str(leaf) +``` + +![](images/answer.png) What kind of variables do we have? + + + + + + + +![](images/do_in_R.png) Do a quick plot of the data. + +```{r} +#| include: false + +#---CODING ANSWER--- + +ggplot(data = leaf, aes(x = birch, y = eggs)) + + geom_boxplot() +``` + +![](images/answer.png) Using your common sense, do these data look normally distributed? + + + + + +![](images/answer.png) Why is a Kruskal-Wallis appropriate in this case? + + + + + + + + + + + +![](images/do_in_R.png) Calculate the medians, means and sample sizes. + +```{r} +#| include: false + +#---CODING ANSWER--- +leaf %>% + group_by(birch) %>% + summarise(mean = mean(eggs), + median = median(eggs), + n = length(eggs)) +``` + +### Applying, interpreting and reporting + +![](images/do_in_R.png) Carry out a Kruskal-Wallis: + +```{r} +kruskal.test(data = leaf, eggs ~ birch) +``` + +![](images/answer.png) What do you conclude from the test? + + + + + + + +A significant Kruskal-Wallis tells us at least two of the groups differ but where do the differences lie? +The Dunn test is a post-hoc multiple comparison test for a significant Kruskal-Wallis. +It is available in the package **`FSA`** + +![](images/do_in_R.png) Load the package using: + +```{r} +library(FSA) +``` + +![](images/do_in_R.png) Run the post-hoc test with: + +```{r} +dunnTest(data = leaf, eggs ~ birch) +``` + +The `P.adj` column gives *p*-value for the comparison listed in the first column. +`Z` is the test statistic. + +![](images/answer.png) What do you conclude from the test? + + + + + + + +![](images/answer.png) Write up the result is a form suitable for a report. + + + + + + + + + + + + + +### Illustrating + +![](images/do_in_R.png) A box plot is an appropriate choice for illustrating a Kruskal-Wallis. +Can you produce a figure like this? + +```{r} +#| echo: false + +#---CODING ANSWER--- +ggplot(leaf, aes(x = birch, y = eggs) ) + + geom_boxplot() + + scale_x_discrete(name = "Birch") + + scale_y_continuous(name = "Number of eggs", + limits = c(0, 110), + expand = c(0, 0)) + + annotate("segment", x = 2, xend = 3, + y = 100, yend = 100, + colour = "black") + + annotate("text", x = 2.5, y = 104, + label = expression(italic(p)~"= 0.035")) + + theme_classic() +``` + +You're finished! + +# 🥳 Well Done! 🎉 + +![Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst](images/tidy-complete.jpg){fig-alt="Cute fuzzy monsters putting rectangular data tables onto a conveyor belt. Along the conveyor belt line are different automated 'stations' that update the data, reading 'WRANGLE', 'VISUALIZE', and 'MODEL.' A monster at the end of the conveyor belt is carrying away a table that reads 'Complete analysis'." width="800"} + +# Independent study following the workshop + +[Consolidate](study_after_workshop.qmd) + +# The Code file + +These contain all the code needed in the workshop even where it is not visible on the webpage. + +The [workshop.qmd](workshop.qmd) file is the file I use to compile the practical. +Qmd stands for Quarto markdown. +It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. +Right-click on the link and choose Save-As to download. +You will be able to open the Rmd file in RStudio. +Alternatively, [View in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs2/week-4/workshop.qmd).Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---` + +Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra] + +# References diff --git a/r4babs4/week-5/images/answer.png b/r4babs4/week-5/images/answer.png new file mode 100644 index 00000000..77438911 Binary files /dev/null and b/r4babs4/week-5/images/answer.png differ diff --git a/r4babs4/week-5/images/do_in_R.png b/r4babs4/week-5/images/do_in_R.png new file mode 100644 index 00000000..a531b10d Binary files /dev/null and b/r4babs4/week-5/images/do_in_R.png differ diff --git a/r4babs4/week-5/images/do_on_internet.png b/r4babs4/week-5/images/do_on_internet.png new file mode 100644 index 00000000..ec33100b Binary files /dev/null and b/r4babs4/week-5/images/do_on_internet.png differ diff --git a/r4babs4/week-5/images/do_on_your_computer.png b/r4babs4/week-5/images/do_on_your_computer.png new file mode 100644 index 00000000..4726f307 Binary files /dev/null and b/r4babs4/week-5/images/do_on_your_computer.png differ diff --git a/r4babs4/week-5/overview.qmd b/r4babs4/week-5/overview.qmd new file mode 100644 index 00000000..fe70f758 --- /dev/null +++ b/r4babs4/week-5/overview.qmd @@ -0,0 +1,43 @@ +--- +title: "Overview" +subtitle: "Two-way ANOVA" +toc: true +toc-location: right +--- + +This week we will extend of our understanding by learning how to include *two* +categorical explanatory variables in a general linear model. This model is often known as the two-way ANOVA. It has three null hypotheses + +### Learning objectives + +The successful student will be able to: + +- combine dataframes of the same structure + +- select, appropriately, two-way ANOVA + +- apply and interpret `lm()` for a two-way ANOVA + +- evaluate whether the assumptions of `lm()` are met + +- understand the meaning of the interaction term + +- scientifically report a two-way ANOVA result including appropriate figures + + +### Instructions + +1. [Prepare](study_before_workshop.qmd) + + i. 📖 Read Two-way ANOVA + + +2. [Workshop](workshop.qmd) + + i. 💻 Two-way ANOVA Choline deficiency on neuron size + ii. 💻 What to do when the assumptions are not met + +3. [Consolidate](study_after_workshop.qmd) + + i. 💻 Appropriately test if the addition of nitrogen and potassium to a crop influences yield and whether they act independently. + diff --git a/r4babs4/week-5/study_after_workshop.qmd b/r4babs4/week-5/study_after_workshop.qmd new file mode 100644 index 00000000..3e428047 --- /dev/null +++ b/r4babs4/week-5/study_after_workshop.qmd @@ -0,0 +1,32 @@ +--- +title: "Independent Study to consolidate this week" +subtitle: "Two-way ANOVA" +toc: true +toc-location: right +format: + html: + code-fold: true + code-summary: "Answer - don't look until you have tried!" +--- + +# Set up + +If you have just opened RStudio you will want to load the **`tidyverse`** package + +```{r} +#| code-fold: false +library(tidyverse) +``` + +# Exercises + +1. 💻 + + +```{r} +#| output: false + + +``` + +2. 📖 Read [xxx](https://3mmarand.github.io/comp4biosci/xxx.html) \ No newline at end of file diff --git a/r4babs4/week-5/study_before_workshop.qmd b/r4babs4/week-5/study_before_workshop.qmd new file mode 100644 index 00000000..086d9960 --- /dev/null +++ b/r4babs4/week-5/study_before_workshop.qmd @@ -0,0 +1,17 @@ +--- +title: "Independent Study to prepare for workshop" +subtitle: "Two-way ANOVA" +toc: true +toc-location: right +--- + +1. Prepare + + i. 📖 Read [Two-way ANOVA](https://3mmarand.github.io/comp4biosci/two_way_anova.html) + + + + + + + diff --git a/r4babs4/week-5/workshop.qmd b/r4babs4/week-5/workshop.qmd new file mode 100644 index 00000000..9ddd35b9 --- /dev/null +++ b/r4babs4/week-5/workshop.qmd @@ -0,0 +1,622 @@ +--- +title: "Workshop" +subtitle: "Two-way ANOVA" +toc: true +toc-location: right +--- + +```{r} +#| include: false +library(tidyverse) +library(kableExtra) +``` + +# Introduction + +![Artwork by @allison_horst: ](images/dragon-residuals-1.png){fig-alt="A green striped dragon stands on a scale, noting how their weight is actually greater than what a model predicts it would be. Text reads 'Hi, I'm a 5.1 foot tall striped dragon and my weight is 3.9 tons...but my actual weight is 4.2 tons!' A stylized equation below shows that the residual is calculated by 4.2 - 3.9 = 0.3 tons." width="800"} + +## Session overview + +In this workshop you will get practice in applying, interpreting and reporting two-way ANOVA including the interaction term and post-hoc testing. + +## Philosophy + +Workshops are not a test. It is expected that you often don't know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips + +- don't worry about making mistakes +- don't let what you can not do interfere with what you can do +- discussing code with your neighbours will help +- look things up in the independent study material +- look things up in your own code from earlier +- there are no stupid questions + +::: callout-note +## Key + +These four symbols are used at the beginning of each instruction so you know where to carry out the instruction. + +![](images/do_on_your_computer.png) Something you need to do on your computer. It may be opening programs or documents or locating a file. + +![](images/do_in_R.png) Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files. + +![](images/do_on_internet.png) Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file. + +![](images/answer.png) A question for you to think about and answer. Record your answers in your script for future reference. +::: + +# Getting started + +![](images/do_on_your_computer.png) Start RStudio from the Start menu. + +![](images/do_in_R.png) Go the Files tab in the lower right pane and click on the `...` on the right. This will open a "Go to folder" window. Navigate to a place on your computer where you keep your work. Click Open. + +![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. Navigate to the `data-analysis-in-r-2` folder and name the RStudio Project `week-5`. + +![](images/do_in_R.png) Make new folders called `data-raw` and `figures`. You can do this on the Files Pane by clicking New Folder and typing into the box that appears. + +![](images/do_in_R.png) Make a new script then save it with a name like `two-way-anova.R` and load the **`tidyverse`** [@tidyverse] package + +![](images/do_in_R.png) Add a comment to the script: `# Two-way ANOVA` + +# Exercises + +## Effect of brain region and choline deficiency on neuron size + +Cognitive performance is influenced by the choline intake in utero. To better understand this phenomenon, pregnant mice were fed a control or choline-deficient diet and their offspring examined. The cross sectional area (CSA) of cholinergic neurons was determined in two brain regions, the MSN and the DB. The data are given in [neuron-csa.xlsx](data-raw/neuron-csa.xlsx) + + +![](images/do_on_your_computer.png) Save a copy of the data file [neuron-csa.xlsx](data-raw/neuron-csa.xlsx) to `data-raw` + +You have [previously read data from an excel file](../r4babs1/week-8/workshop.html#importing-data-from-files). + + +![](images/do_in_R.png) List the the names of the work sheets in the excel workbook. + +```{r} +#| include: false + +#---CODING ANSWER--- + +# load the package +library(readxl) + +excel_sheets("data-raw/neuron-csa.xlsx") +# There are two sheets in the file. Each sheet contains a different brain region + +``` +These data are organised into two worksheets, one for each brain region + + +![](images/do_in_R.png) Read in each sheet. I used the names `db` and `msn` for the two dataframes/tibble. + +```{r} +#| include: false + +#---CODING ANSWER--- + +db <- read_excel("data-raw/neuron-csa.xlsx", sheet = "DB") +msn <- read_excel("data-raw/neuron-csa.xlsx", sheet = "MSN") +``` + + +![](images/do_in_R.png) We have the top half and the bottom half of a data set and can combine these togther with `bind_rows()` + +```{r} +neuron <- bind_rows(db, msn) +``` + +You might want to click on `neuron` in the environment to open the spreadsheet-like view to check it looks how you expect. + +![](images/answer.png) What kind of variables do you have? + + + + + + + + +### Exploring + +When we have a single explanatory variable, it always goes on the *x*-axis. Here we have two explanatory variables: brain region and diet. We can map one of the explanatory variables to the *x*-axis and the other to a aesthetic like colour, shape or fill. + +![](images/do_in_R.png) Do a quick plot of the data: + +```{r} +ggplot(data = neuron, aes(x = BrainRegion, y = CSA, fill = Diet)) + + geom_violin() +``` + +Whether we map BrainRegion to the *x*-axis or the fill does not really matter. It looks as though the cross sectional area of neurons is higher for the control diet than the deficient diet (the average of the read bars is grater than the average of the blue bars). It also looks like there might be a significant interaction between the effects of diet and brain region because the effect of diet seems to be greater in the DB region. + +### Summarising the data + +Just as we needed to incorporate the second explanatory variable in the rough plot, we need to incorporate it into our summary. We do this by adding it to the `group_by()`. + +![](images/do_in_R.png) Create a data frame called `neuron_summary` that contains the means, standard deviations, sample sizes and standard errors for each group: + +```{r} +neuron_summary <- neuron %>% + group_by(BrainRegion, Diet) %>% + summarise(mean = mean(CSA), + std = sd(CSA), + n = length(CSA), + se = std/sqrt(n)) + +``` +You will get a message that you don't need to worry about ` `summarise()` has grouped output by 'BrainRegion'. You can override using the `.groups` argument.>` + +You should get the following numbers: + +```{r} +#| echo: false + +knitr::kable(neuron_summary) %>% + kableExtra::kable_styling() +``` + +### Applying, interpreting and reporting + +We can now carry out a two-way ANOVA using the same `lm()` function we used for two-sample tests and one-way ANOVA. + +![](images/do_in_R.png) Carry out an ANOVA and examine the results with: + +```{r} +mod <- lm(data = neuron, CSA ~ BrainRegion * Diet) +summary(mod) +``` + +Remember: the tilde (`~`) means test the values in `CSA` when grouped by the values in `BrainRegion` and `Diet` Or explain `CSA` with `BrainRegion` and `Diet` + +![](images/answer.png) Can you relate the values under `Estimate` to the means? + + + + + + + + + + + + + + + + + + + +The model of brain region and diet overall explains a significant amount of the variation in the cross sectional area of neurons (`p-value: 0.0002949`). To see which of the three effects are significant we can use the `anova()` function on our model. + +![](images/do_in_R.png) Determine which effects are significant: +```{r} +anova(mod) +``` + +There is a significant effect of brain region (*F* = 10.8; *d.f.* = 1, 36; *p* = 0.002) and diet (*F* = 9.3; *d.f.* = 1, 36; *p* = 0.004) on CSA and these effects interact (*F* = 4.3; *d.f.* = 1, 36; *p* = 0.046) + + +We need a post-hoc test to see which comparisons are significant and can again use then **`emmeans`** [@emmeans] package. + +![](images/do_in_R.png) Load the package + +```{r} +library(emmeans) +``` + +![](images/do_in_R.png) Carry out the post-hoc test + +```{r} +emmeans(mod, ~ BrainRegion * Diet) |> pairs() + +``` + +Each row is a comparison between the two means in the 'contrast' column. The 'estimate' column is the difference between those means and the 'p.value' indicates whether that difference is significant. + +A plot can be used to visualise the result of the post hoc which can be especially useful when there are very many comparisons. + +![](images/do_in_R.png) Plot the results of the post-hoc test: + +```{r} +emmeans(mod, ~ BrainRegion * Diet) |> plot() +``` + +![](images/answer.png) What do you conclude from the test? + + + + + + + + + + + +We might report this result as: + +A choline-deficient diet in pregnant mice significantly decreases the cross sectional area of cholinergic neurons in the DB region of their offspring (*t* = 3.62; *d.f.* = 36; *p* = 0.0048). The cross sectional area of cholinergic neurons in the MSN region are also significantly smaller than those in the DB region (*t* = 3.79; *d.f.* = 36; *p* = 0.0030) but are not reduces by maternal choline-deficiency. + + +### Check assumptions + +The assumptions of the general linear model are that the residuals -- the difference between predicted value (i.e., the group mean) and observed values - are normally distributed and have homogeneous variance. +To check these we can examine the `mod$residuals` variable. +You may want to refer to [Checking assumptions](../week-2/workshop.html#checking-assumptions) in the "Single regression" workshop. + +![](images/do_in_R.png) Plot the model residuals against the fitted values. + +```{r} +#| include: false + +#---CODING ANSWER- + +plot(mod, which = 1) +``` + +![](images/answer.png) What to you conclude? + + + + + + + +To examine normality of the model residuals we can plot them as a histogram and do a normality test on them. + +![](images/do_in_R.png) Plot a histogram of the residuals. + +```{r} +#| include: false + +#---CODING ANSWER- + +ggplot(mapping = aes(x = mod$residuals)) + + geom_histogram(bins = 8) +``` + +![](images/do_in_R.png) Use the `shapiro.test()` to test the normality of the model residuals + +```{r} +#| include: false + +#---CODING ANSWER- + +shapiro.test(mod$residuals) +``` + +![](images/answer.png) What to you conclude? + + + + + + + + + +### Illustrating + +We are going to create a figure like this: + +```{r} +#| echo: false + +ggplot() + + geom_point(data = neuron, + aes(x = BrainRegion, y = CSA, shape = Diet), + position = position_jitterdodge(dodge.width = 1, + jitter.width = 0.3, + jitter.height = 0), + colour = "gray50", + size = 3) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean - se, + ymax = mean + se, + group = Diet), + width = 0.4, + position = position_dodge(width = 1)) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean, + ymax = mean, + group = Diet), + width = 0.3, + linewidth = 1, + position = position_dodge(width = 1) ) + + scale_y_continuous(name = "CSA", + expand = c(0, 0), + limits = c(0, 45)) + + scale_x_discrete(name = "BrainRegion") + + # DB Control - DB Deficient ** + annotate("segment", + x = 0.75, xend = 1.25, + y = 35, yend = 35, + colour = "black") + + annotate("text", + x = 1, y = 36, + label = "**", size = 6) + + # Control.MSN - Control.DB ** + annotate("segment", + x = 0.75, xend = 1.75, + y = 38, yend = 38, + colour = "black") + + annotate("text", x = 1.25, y = 39, + label = "**", size = 6) + + # DB Control - MSN Deficient *** + annotate("segment", + x = 0.75, xend = 2.25, + y = 41, yend = 41, + colour = "black") + + annotate("text", x = 1.5, y = 42, + label = "***", size = 6) + + theme_classic() + + theme(legend.position = c(0.15, 0.15), + legend.background = element_rect(colour = "black")) + +``` + +We will again use both our `neuron` and `neuron_summary` dataframes. + +![](images/do_in_R.png) Try emulating what you did for [one-way ANOVA](../week-4/workshop.html) based on [Visualise](../../r4babs1/week-9/workshop.html#visualise) from the "Summarising data with several variables" workshop [@rand2023]. + +```{r} +ggplot() + + geom_point(data = neuron, + aes(x = BrainRegion, y = CSA), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50", + size = 3) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean - se, + ymax = mean + se), + width = 0.4) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean, + ymax = mean), + width = 0.3, + linewidth = 1) + + scale_y_continuous(name = "CSA", + expand = c(0, 0), + limits = c(0, 45)) + + scale_x_discrete(name = "BrainRegion") + + theme_classic() + +``` + +How can we show the two diets separately? + +![](images/do_in_R.png) We can map the `Diet` variable to the shape aesthetic! +```{r} +ggplot() + + geom_point(data = neuron, + aes(x = BrainRegion, y = CSA, shape = Diet), + position = position_jitter(width = 0.1, height = 0), + colour = "gray50", + size = 3) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean - se, + ymax = mean + se), + width = 0.4) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean, + ymax = mean), + width = 0.3, + linewidth = 1) + + scale_y_continuous(name = "CSA", + expand = c(0, 0), + limits = c(0, 45)) + + scale_x_discrete(name = "BrainRegion") + + theme_classic() + +``` +Oh, that isn't quite what we want! We want the two diets side-by-side, not on top of each other. + +![](images/do_in_R.png) We can achieve that by using setting the `position` argument to `position_jitterdodge()` in the `geom_point()` and to `position_dodge()` in the two `geom_errorbar()`. We also have to specify that the error bars are grouped by `Diet` since they are not otherwise mapped to a shape, colour or fill. + +```{r} +ggplot() + + geom_point(data = neuron, + aes(x = BrainRegion, y = CSA, shape = Diet), + position = position_jitterdodge(dodge.width = 1, + jitter.width = 0.3, + jitter.height = 0), + colour = "gray50", + size = 3) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean - se, + ymax = mean + se, + group = Diet), + width = 0.4, + position = position_dodge(width = 1)) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean, + ymax = mean, + group = Diet), + width = 0.3, + linewidth = 1, + position = position_dodge(width = 1)) + + scale_y_continuous(name = "CSA", + expand = c(0, 0), + limits = c(0, 45)) + + scale_x_discrete(name = "BrainRegion") + + theme_classic() + +``` +![](images/do_in_R.png) Add the annotation of the statistical results + +```{r} +#| include: false + +#---CODING ANSWER--- + +ggplot() + + geom_point(data = neuron, + aes(x = BrainRegion, y = CSA, shape = Diet), + position = position_jitterdodge(dodge.width = 1, + jitter.width = 0.3, + jitter.height = 0), + colour = "gray50", + size = 3) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean - se, + ymax = mean + se, + group = Diet), + width = 0.4, + position = position_dodge(width = 1)) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean, + ymax = mean, + group = Diet), + width = 0.3, + linewidth = 1, + position = position_dodge(width = 1)) + + scale_y_continuous(name = "CSA", + expand = c(0, 0), + limits = c(0, 45)) + + scale_x_discrete(name = "BrainRegion") + + # DB Control - DB Deficient ** + annotate("segment", + x = 0.75, xend = 1.25, + y = 35, yend = 35, + colour = "black") + + annotate("text", + x = 1, y = 36, + label = "**", size = 6) + + # Control.MSN - Control.DB ** + annotate("segment", + x = 0.75, xend = 1.75, + y = 38, yend = 38, + colour = "black") + + annotate("text", x = 1.25, y = 39, + label = "**", size = 6) + + # DB Control - MSN Deficient *** + annotate("segment", + x = 0.75, xend = 2.25, + y = 41, yend = 41, + colour = "black") + + annotate("text", x = 1.5, y = 42, + label = "***", size = 6) + + theme_classic() + +``` +![](images/do_in_R.png) Finally, we can move the legend to a space on the plot area which helps you minimise the width needed like this: + +```{r} +#| eval: false + +...... + + theme(legend.position = c(0.15, 0.15), + legend.background = element_rect(colour = "black")) + +``` + + +![](images/do_in_R.png) Save your figure to your figures folder. + +```{r} +#| echo: false + +#---CODING ANSWER--- + +# Assign the figure to a variable +fig1 <- ggplot() + + geom_point(data = neuron, + aes(x = BrainRegion, y = CSA, shape = Diet), + position = position_jitterdodge(dodge.width = 1, + jitter.width = 0.3, + jitter.height = 0), + colour = "gray50", + size = 3) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean - se, + ymax = mean + se, + group = Diet), + width = 0.4, + position = position_dodge(width = 1)) + + geom_errorbar(data = neuron_summary, + aes(x = BrainRegion, + ymin = mean, + ymax = mean, + group = Diet), + width = 0.3, + linewidth = 1, + position = position_dodge(width = 1) ) + + scale_y_continuous(name = "CSA", + expand = c(0, 0), + limits = c(0, 45)) + + scale_x_discrete(name = "BrainRegion") + + # DB Control - DB Deficient ** + annotate("segment", + x = 0.75, xend = 1.25, + y = 35, yend = 35, + colour = "black") + + annotate("text", + x = 1, y = 36, + label = "**", size = 6) + + # Control.MSN - Control.DB ** + annotate("segment", + x = 0.75, xend = 1.75, + y = 38, yend = 38, + colour = "black") + + annotate("text", x = 1.25, y = 39, + label = "**", size = 6) + + # DB Control - MSN Deficient *** + annotate("segment", + x = 0.75, xend = 2.25, + y = 41, yend = 41, + colour = "black") + + annotate("text", x = 1.5, y = 42, + label = "***", size = 6) + + theme_classic() + + theme(legend.position = c(0.15, 0.15), + legend.background = element_rect(colour = "black")) + +``` + +```{r} +#| include: false + +#---CODING ANSWER--- + +# save figure to figures/seal.png +ggsave("figures/neuron.png", + plot = fig1, + width = 4, + height = 3.5, + units = "in", + dpi = 300) + +``` + +You're finished! + +# 🥳 Well Done! 🎉 + +![Artwork by @allison_horst: ](images/dragon-residuals-2.png){fig-alt="A hoard of multiple different dragons stand on scales, shouting out their residuals based on the model estimates (e.g. 'My residual is 0.2 tons! My residual is -0.4 tons! My residual is 0.9 tons!' A histogram to the right shows the distribution of these residuals, with text 'Check: are residuals normally distributed?'" width="800"} + +# Independent study following the workshop + +[Consolidate](study_after_workshop.qmd) + +# The Code file + +These contain all the code needed in the workshop even where it is not visible on the webpage. + +The [workshop.qmd](workshop.qmd) file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Rmd file in RStudio. Alternatively, [View in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs2/week-5/workshop.qmd).Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---` + +Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra] + +# References diff --git a/r4babs4/week-6/images/answer.png b/r4babs4/week-6/images/answer.png new file mode 100644 index 00000000..77438911 Binary files /dev/null and b/r4babs4/week-6/images/answer.png differ diff --git a/r4babs4/week-6/images/do_in_R.png b/r4babs4/week-6/images/do_in_R.png new file mode 100644 index 00000000..a531b10d Binary files /dev/null and b/r4babs4/week-6/images/do_in_R.png differ diff --git a/r4babs4/week-6/images/do_on_internet.png b/r4babs4/week-6/images/do_on_internet.png new file mode 100644 index 00000000..ec33100b Binary files /dev/null and b/r4babs4/week-6/images/do_on_internet.png differ diff --git a/r4babs4/week-6/images/do_on_your_computer.png b/r4babs4/week-6/images/do_on_your_computer.png new file mode 100644 index 00000000..4726f307 Binary files /dev/null and b/r4babs4/week-6/images/do_on_your_computer.png differ diff --git a/r4babs4/week-6/overview.qmd b/r4babs4/week-6/overview.qmd new file mode 100644 index 00000000..eaa9a5a9 --- /dev/null +++ b/r4babs4/week-6/overview.qmd @@ -0,0 +1,37 @@ +--- +title: "Overview" +subtitle: "Chi-squared tests and correlation" +toc: true +toc-location: right +--- + +This week you will + +### Learning objectives + +The successful student will be able to: + +- + +- + +- + +- + +### Instructions + +1. [Prepare](study_before_workshop.qmd) + + i. 📖 Read the book **OR** 📹 Watch two videos + + +2. [Workshop](workshop.qmd) + + i.💻 + +3. [Consolidate](study_after_workshop.qmd) + + i. 💻 + + ii. 📖 Read diff --git a/r4babs4/week-6/study_after_workshop.qmd b/r4babs4/week-6/study_after_workshop.qmd new file mode 100644 index 00000000..ef9aaee3 --- /dev/null +++ b/r4babs4/week-6/study_after_workshop.qmd @@ -0,0 +1,32 @@ +--- +title: "Independent Study to consolidate this week" +subtitle: "Chi-squared tests and correlation" +toc: true +toc-location: right +format: + html: + code-fold: true + code-summary: "Answer - don't look until you have tried!" +--- + +# Set up + +If you have just opened RStudio you will want to load the **`tidyverse`** package + +```{r} +#| code-fold: false +library(tidyverse) +``` + +# Exercises + +1. 💻 + + +```{r} +#| output: false + + +``` + +2. 📖 Read [xxx](https://3mmarand.github.io/comp4biosci/xxx.html) \ No newline at end of file diff --git a/r4babs4/week-6/study_before_workshop.qmd b/r4babs4/week-6/study_before_workshop.qmd new file mode 100644 index 00000000..6b7fd412 --- /dev/null +++ b/r4babs4/week-6/study_before_workshop.qmd @@ -0,0 +1,16 @@ +--- +title: "Independent Study to prepare for workshop" +subtitle: "Chi-squared tests and correlation" +toc: true +toc-location: right +--- + +1. Prepare + + i. **Either** 📖 Read [xxxxx](https://3mmarand.github.io/comp4biosci/xx.html) in **OR** 📹 Watch + + {{< video https://www.youtube.com/watch?xx >}} + + + + diff --git a/r4babs4/week-6/workshop.qmd b/r4babs4/week-6/workshop.qmd new file mode 100644 index 00000000..b5e686de --- /dev/null +++ b/r4babs4/week-6/workshop.qmd @@ -0,0 +1,83 @@ +--- +title: "Workshop" +subtitle: "Chi-squared tests and correlation" +toc: true +toc-location: right +--- + +```{r} +#| include: false +library(tidyverse) +library(kableExtra) +``` + +# Introduction + +![Artwork by @allison_horst: ](images/xxx.png){fig-alt="xxxxx" width="1000"} + +## Session overview + +In this session you will + +## Philosophy + +Workshops are not a test. It is expected that you often don't know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips + +- don't worry about making mistakes +- don't let what you can not do interfere with what you can do +- discussing code with your neighbours will help +- look things up in the independent study material +- look things up in your own code from earlier +- there are no stupid questions + +::: callout-note +## Key + +These four symbols are used at the beginning of each instruction so you know where to carry out the instruction. + +![](images/do_on_your_computer.png) Something you need to do on your computer. It may be opening programs or documents or locating a file. + +![](images/do_in_R.png) Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files. + +![](images/do_on_internet.png) Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file. + +![](images/answer.png) A question for you to think about and answer. Record your answers in your script for future reference. +::: + +# Getting started + +![](images/do_on_your_computer.png) Start RStudio from the Start menu. + +![](images/do_in_R.png) Go the Files tab in the lower right pane and click on the `...` on the right. This will open a "Go to folder" window. Navigate to a place on your computer where you keep your work. Click Open. + +![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. Navigate to the `data-analysis-in-r-2` folder and name the RStudio Project `week-6`. + +![](images/do_in_R.png) Make new folders called `data-raw` and `figures`. You can do this on the Files Pane by clicking New Folder and typing into the box that appears. + +![](images/do_in_R.png) Make a two new scripts called `correlation.R` and `chi-squared-tests.R` to carry out the rest of the work. + +![](images/do_in_R.png) Add a comments to each script such as: `# Correlation` and `#Chi-squared tests` and load the **`tidyverse`** [@tidyverse] package in each + +# Exercises + + + +You're finished! + +# 🥳 Well Done! 🎉 + +![Artwork by @allison_horst: ](images/xxx.png){fig-alt="xxx" width="1000"} + +# Independent study following the workshop + +[Consolidate](study_after_workshop.qmd) + +# The Code file + +These contain all the code needed in the workshop even where it is not visible on the webpage. + +The [workshop.qmd](workshop.qmd) file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. Right-click on the link and choose Save-As to download. You will be able to open the Rmd file in RStudio. Alternatively, [View in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs2/week-1/workshop.qmd).Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---` + +Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra] + +# References diff --git a/update_notes.txt b/update_notes.txt index caf2e98c..39beef80 100644 --- a/update_notes.txt +++ b/update_notes.txt @@ -54,16 +54,21 @@ Group only access to relevant folder done by SAS VLE iframe Week 1 Data Analysis in R for BABS 2 - - +