03-wrangling2.qmd

# Data wrangling II {#sec-wrangling2}

## Intended Learning Outcomes {.unnumbered}

By the end of this chapter, you should be able to:

-   apply familiar data wrangling functions to novel datasets
-   read and interpret error messages
-   realise there are several ways of getting to the results

In this chapter, we will pick up where we left off in @sec-wrangling. We will calculate average scores for two of the questionnaires, address an error mode problem, and finally, join all data objects together. This will finalise our data for the upcoming data visualization sections (@sec-dataviz and @sec-dataviz2).


## [Individual Walkthrough]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}

## Activity 1: Setup

* Go to the project folder we have been using in the last two weeks and double-click on the project icon to **open the project** in RStudio
* Either **Create a new `.Rmd` file** for chapter 3 and save it to your project folder or continue the one from last week. See @sec-rmd if you need some guidance.


## Activity 2: Load in the libraries and read in the data

Today, we will be using `tidyverse` along with the two csv files created at the end of the last chapter: `data_prp_for_ch3.csv` and `qrp_t1.csv`. If you need to download them again for any reason, click on the following links: [data_prp_for_ch3.csv](data/data_prp_for_ch3.csv "download") and [qrp_t1.csv](data/qrp_t1.csv "download").

::: {.callout-note collapse="true" icon="false"}

## Hint

```{r eval=FALSE}
library(???)
data_prp <- read_csv("???")
qrp_t1 <- read_csv("???")
```

```{r include=FALSE, message=TRUE}
## I basically have to have 2 code chunks since I tell them to put the data files next to the project, and mine are in a separate folder called data - unless I'll turn this into a fixed path
library(tidyverse)
data_prp <- read_csv("data/prp_data_reduced.csv")
qrp_t1 <- read_csv("data/qrp_t1.csv")
```

:::

::: {.callout-caution collapse="true" icon="false"}

## Solution

```{r eval=FALSE}
library(tidyverse)
data_prp <- read_csv("prp_data_reduced.csv")
qrp_t1 <- read_csv("qrp_t1.csv")
```

:::

If you need a quick reminder what the dataset was about, have a look at the abstract in @sec-download_data_ch1. We also addressed the changes we made to the dataset there.

And remember to have a quick `glimpse()` at your data.


## Activity 3: Confidence in understanding Open Science practices

#### The main goal is to compute the mean Understanding score per participant. {.unnumbered}

The mean Understanding score for time point 2 has already been calculated (in the `Time2_Understanding_OS` column), but we still need to compute it for time point 1.

Looking at the Understanding data at time point 1, you determine that

* individual item columns are `r mcq(c(x = "numeric", answer = "character"))`, and
* according to the codebook, there are `r mcq(c(answer = "no", x = "some"))` reverse-coded items in this questionnaire.

The steps are quite similar to those for QRP, but we need to add an extra step: converting the character labels into numbers.

Again, let's do this step by step:

* **Step 1**: Select the relevant columns `Code`, and every Understanding column from time point 1 (e.g., from `Understanding_OS_1_Time1` to `Understanding_OS_12_Time1`) and store them in an object called `understanding_t1`
* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can recode the labels into values (step 3) and calculate the average score (in step 4) more easily
* **Step 3**: Recode the values "Not at all confident" as 1 and "Entirely confident" as 7. All other values are already numbers. We can use functions `mutate()` in combination with `case_match()` for that
* **Step 4**: Calculate the average Understanding Open Science score (`Time1_Understanding_OS`) per participant using `group_by()` and `summarise()`

#### Steps 1 and 2: Select and pivot {.unnumbered}

How about you try the first 2 steps yourself using the code from Chapter 2 Activity 4 (@sec-ch2_act4) as a template?

```{r understanding, eval=FALSE}
understanding_t1 <- data_prp %>% 
  select(???) %>% # Step 1
  pivot_longer(cols = ???, names_to = "???", values_to = "???") # Step 2
```

::: {.callout-caution collapse="true" icon="false"}

## Solution for steps 1 and 2

```{r}
understanding_t1 <- data_prp %>% 
  # Step 1
  select(Code, Understanding_OS_1_Time1:Understanding_OS_12_Time1) %>% 
  # Step 2 - I picked different column labels this time for some variety
  pivot_longer(cols = Understanding_OS_1_Time1:Understanding_OS_12_Time1, names_to = "Understanding_Qs", values_to = "Responses") 
```

:::

#### Step 3: recoding the values {.unnumbered}

OK, we now want to recode the values in the `Responses` column (or whatever name you picked for your column that has some of the numbers in it) so that "Not at all confident" = 1 and "Entirely confident" = 7. We want to keep all other values as they are (2-6 look already quite "numeric").

Let's create a new column `Responses_corrected` that stores the new values with `mutate()`. Then we can combine that with the `case_match()` function.

* The first argument in `case_match()` is the column name of the variable you want to recode.
* Then you can start recoding the values in the way of `CurrentValue ~ NewValue` (~ is a tilde). Make sure you use the `~` and not `=`!
* Lastly, the `.default` argument tells R what to do with values that are neither "Not at all confident" nor "Entirely confident". Here, we want to replace them with the original value of the `Responses` column. In other datasets, you may want to set the default to `NA` for missing values, a character string or a number, and `case_match()` is happy to oblige.

```{r error=TRUE}
understanding_t1 <- understanding_t1 %>% 
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = Responses # all other values taken from column Responses
  ))
```

::: {.callout-important collapse="true"}

## Error!!! Can you explain what is happening here?

Have a look at the error message. It's pretty helpful this time. It says `Can't combine ..1 (right) <double> and .default <character>.` It means that the replacement values are expected to be data type character since the original column type was type character.

:::

**So how do we fix this?** Actually, there are several ways this could be done. Click on the tabs below to check out 3 possible solutions.

::: {.panel-tabset group="layers"}

## Fix option 1

One option is to modify the `.default` argument `Responses` so that the values are copied over from the original column but as a number rather than the original character value. The function `as.numeric()` does the conversion.

```{r warning=FALSE}
understanding_t1_step3_v1 <- understanding_t1 %>% 
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type 
  ))
```

## Fix option 2

Change the numeric values on the right side of the `~` to character. Then in a second step, we would need to turn the character column into a numeric type. Again, we have several options to do so. We could either use the `parse_number()` function we encountered earlier during the demographics wrangling or the `as.numeric()` function.

* V1: `Responses_corrected = parse_number(Responses_corrected)`
* V2: `Responses_corrected = as.numeric(Responses_corrected)`

Just pay attention that you are still working *within* the `mutate()` function.

```{r}
understanding_t1_step3_v2 <- understanding_t1 %>% 
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ "1",
                                          "Entirely confident" ~ "7",
                                          .default = Responses # all other values taken from column Responses (character)
  ),
  Responses_corrected = parse_number(Responses_corrected)) # turning Responses_corrected into a numeric column
```


## Fix option 3

If you recode all the labels into numbers (e.g., "2" into 2, "3" into 3, etc.) from the start, you won’t need to perform any additional conversions later.

```{r}
understanding_t1_step3_v2 <- understanding_t1 %>% 
  mutate(Responses_recoded = case_match(Responses, # column of the values to recode
                                        "Not at all confident" ~ 1, # recode all of them
                                        "2" ~ 2,
                                        "3" ~ 3,
                                        "4" ~ 4,
                                        "5" ~ 5,
                                        "6" ~ 6,
                                        "Entirely confident" ~ 7))
```

:::

::: {.callout-note icon="false"}

## Your Turn

Choose the option that works best for you to **modify the code of `understanding_t1`** above that didn't work/ gave you an error message. Once you do that, you should be able to calculate the **mean Understanding Score per participant**. Store the average scores in a variable called `Time1_Understanding_OS`. If you need help, refer to the hint below or use Chapter 2 Activity 4 (@sec-ch2_act4) as guidance.

::: {.callout-caution icon="false" collapse="true"}

## One solution for Steps 3 and 4

```{r warning=FALSE}
understanding_t1 <- understanding_t1 %>% 
  # Step 3
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type 
  )) %>% 
  # Step 4: calculating averages per participant
  group_by(Code) %>%
  summarise(Time1_Understanding_OS = mean(Responses_corrected)) %>%
  ungroup()

```

:::

:::

Of course, this could have been written up as a single pipe.

::: {.callout-caution collapse="true" icon="false"}

## Single pipe of activity 3

```{r message=FALSE, warning=FALSE}
understanding_t1 <- data_prp %>% 
  # Step 1
  select(Code, Understanding_OS_1_Time1:Understanding_OS_12_Time1) %>% 
  # Step 2
  pivot_longer(cols = -Code, names_to = "Understanding_Qs", values_to = "Responses") %>% 
  # Step 3
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type 
  )) %>% 
  # Step 4
  group_by(Code) %>%
  summarise(Time1_Understanding_OS = mean(Responses_corrected)) %>%
  ungroup()
```

:::

## Activity 4: Survey of Attitudes Toward Statistics (SATS-28)

#### The main goal is to compute the mean SATS-28 score for each of the 4 subscales per participant for time point 1. {.unnumbered}

Looking at the SATS data at time point 1, you determine that

* individual item columns are `r mcq(c(answer = "numeric", x = "character"))`, and
* according to the codebook, there are `r mcq(c(x = "no", answer = "some"))` reverse-coded items in this questionnaire.
* Additionally, we are looking to compute the means for the 4 different subscales of the SAT-28 which are `r fitb("Affect",ignore_case = TRUE)`, `r fitb("Cognitive Competence",ignore_case = TRUE)`, `r fitb("Value",ignore_case = TRUE)`, and `r fitb("Difficulty",ignore_case = TRUE)`.

This scenario is slightly more tricky than the previous ones due to the reverse-coding and the 4 subscales. So, let's tackle this step by step again:

* **Step 1**: Select the relevant columns `Code`, and every SATS28 column from time point 1 (e.g., from `SATS28_1_Affect_Time1` to `SATS28_28_Difficulty_Time1`) and store them in an object called `sats_t1`
* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can recode the labels into values (step 3) and calculate the average score (in step 4) more easily
* **Step 3**: We need to know which items belong to which subscale - fortunately, we have that information in the variable name and can use the `separate()` function to access it.
* **Step 4**: We need to know which items are reverse-coded and then reverse-score them - unfortunately, the info is only in the codebook and we need to find a work-around. `case_when()` can help identify and re-score the reverse-coded items.
* **Step 5**: Calculate the average SATS score per participant and subscale using `group_by()` and `summarise()`
* **Step 6**: use `pivot_wider()` to spread out the dataframe into wide format and `rename()` to tidy up the column names

#### Steps 1 and 2: select and pivot {.unnumbered}

The selecting and pivoting are exactly the same way as we already practiced in the other 2 questionnaires. Apply them here to this questionnaire.

::: {.callout-note collapse="true" icon="false"}

## Hint

```{r SATS28, eval=FALSE}
sats_t1 <- data_prp %>% 
  select(???) %>% # Step 1
  pivot_longer(cols = ???, names_to = "???", values_to = "???") # Step 2
```

::: {.callout-caution collapse="true" icon="false"}

## Solution for steps 1 and 2

```{r}
sats_t1 <- data_prp %>% 
  select(Code, SATS28_1_Affect_Time1:SATS28_28_Difficulty_Time1) %>% # Step 1
  pivot_longer(cols = -Code, names_to = "Items", values_to = "Response") # Step 2
```

:::

:::

#### Step 3: separate Subscale information {.unnumbered}

If you look at the `Items` column more closely, you can see that there is information on the `Questionnaire`, the `Item_number`, the `Subscale`, and the `Timepoint` the data was collected at.

We can separate the information into separate columns using the `separate()` function. The function's first argument is the column to separate, then define `into` which columns you want the original column to split up, and lastly, define the separator `sep` (here an underscore). For our example, we would write:

* V1: `separate(Items, into = c("SATS", "Item_number", "Subscale", "Time"), sep = "_")`

However, we don't need all of those columns, so we could just drop the ones we are not interested in by replacing them with `NA`.

* V2: `separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_")`

We might also add an extra argument of `convert = TRUE` to have numeric columns (i.e., `Item_number`) converted to numeric as opposed to keeping them as characters. Saves us typing a few quotation marks later in Step 4.

```{r}
sats_t1 <- sats_t1 %>% 
  # Step 3
  separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_", convert = TRUE)

```

#### Step 4: identifying reverse-coded items and then correct them {.unnumbered}

We can use `case_when()` within the `mutate()` function here to create a new column `FW_RV` that stores information on whether the item is a reverse-coded item or not.

`case_when()` works similarly to `case_match()`, however `case_match()` only allows us to "recode" values (i.e., replace one value with another), whereas `case_when()` is more flexible. It allows us to use **conditional statements** on the left side of the tilde which is useful when you want to change only *some* of the data based on specific conditions.

Looking at the codebook, it seems that items 2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, and 28 are reverse-coded. The rest are forward-coded.

We want to tell R now, that

* **if** the `Item_number` is any of those numbers listed above, R should write "Reverse" into the new column `FW_RV` we are creating. Since we have a few possible matches for `Item_number`, we need the Boolean expression `%in%` rather than `==`.
* **if** `Item_number` is none of those numbers, then we would like the word "Forward" in the `FW_RV` column to appear. We can achieve that by specifying a `.default` argument again, but this time we want a "word" rather than a value from another column.

```{r}
sats_t1 <- sats_t1 %>% 
  mutate(FW_RV = case_when(
    Item_number %in% c(2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, 28) ~ "Reverse",
    .default = "Forward"
  ))
```

Moving on to correcting the scores: Once again, we can use `case_when ()` within the `mutate()` function to create another **conditional statement**. This time, the condition is:

* **if** `FW_RV` column has a value of "Reverse" then we would like to turn all 1 into 7, 2 into 6, etc.
* **if** `FW_RV` column has a value of "Forward" then we would like to keep the score from the `Response` column

There is a quick way and a not-so-quick way to achieve the actual **reverse-coding**.

* **Option 1 (quick)**: The easiest way to reverse-code scores is to take the maximum value of the scale, add 1 unit, and subtract the original value. For example, on a 5-point Likert scale, it would be 6 minus the original rating; for a 7-point Likert scale, 8 minus the original rating, etc. (see *Option 1* tab).
* **Option 2 (not so quick)**: This involves using two conditional statements (see *Option 2* tab).

Use the one you find more intuitive.

::: panel-tabset

## Option 1

Here we are using a Boolean expression to check if the string "Reverse" is present in the `FW_RV` column. If this condition is `TRUE`, the value in the new column we're creating, `Scores_corrected`, will be calculated as 8 minus the value from the Response column. If the condition is FALSE (handled by the .default argument), the original values from the `Response` column will be retained.

```{r}
sats_t1 <- sats_t1 %>% 
  mutate(Scores_corrected = case_when(
    FW_RV == "Reverse" ~ 8-Response,
    .default = Response
  ))
```

## Option 2

As stated above, the longer approach involves using two conditional statements. The first condition checks if the value in the `FW_RV` column is "Reverse", while the second condition checks if the value in the `Response` column equals a specific number. **When both conditions are met**, the corresponding value on the right side of the tilde is placed in the newly created `Scores_corrected_v2` column.

For example, line 3 would read: if the value in the `FW_RV` column is "Reverse" **AND** the value in the `Response` column is 1, then assign a value of 7 to the `Scores_corrected_v2` column.

```{r}
sats_t1 <- sats_t1 %>% 
  mutate(Scores_corrected_v2 = case_when(
    FW_RV == "Reverse" & Response == 1 ~ 7,
    FW_RV == "Reverse" & Response == 2 ~ 6,
    FW_RV == "Reverse" & Response == 3 ~ 5,
    # no need to recode 4 as 4
    FW_RV == "Reverse" & Response == 5 ~ 3,
    FW_RV == "Reverse" & Response == 6 ~ 2,
    FW_RV == "Reverse" & Response == 7 ~ 1,
    .default = Response
  ))
```

As you can see now in `sats_t1`, both columns `Scores_corrected` and `Scores_corrected_v2` are identical.

:::

One way to **check whether our reverse-coding worked** is by examining the `distinct` values in the original `Response` column and comparing them with the `Scores_corrected`. We should also retain the `FW_RV` column to observe how the reverse-coding applied.

To see the patterns more clearly, we can use `arrange()` to sort the values in a meaningful order. Remember, the default sorting order is ascending, so if you want to sort values in descending order, you’ll need to wrap your variable in the desc() function.

```{r}
check_coding <- sats_t1 %>% 
  distinct(FW_RV, Response, Scores_corrected) %>% 
  arrange(desc(FW_RV), Response)
```

::: {.callout-caution collapse="true" icon="false"}

## Show `check_coding` output

```{r}
check_coding
```

:::

#### Step 5 {.unnumbered}

Now that we know everything worked out as intended, we can calculate the mean scores of each subscale for each participant in `sats_t1`.

::: {.callout-note collapse="true" icon="false"}

## Hint

```{r eval=FALSE}
sats_t1 <- sats_t1 %>% 
  group_by(???, ???) %>% 
  summarise(mean_score = ???(???)) %>% 
  ungroup()
```

::: {.callout-caution collapse="true" icon="false"}

## Solution

```{r}
sats_t1 <- sats_t1 %>% 
  group_by(Code, Subscale) %>% 
  summarise(mean_score = mean(Scores_corrected)) %>% 
  ungroup()
```

:::

:::

#### Step 6 {.unnumbered}

The final step is to transform the data back into wide format, ensuring that each subscale has its own column. This will make it easier to join the data objects later on. In `pivot_wider()`, the first argument, `names_from`, specifies the column you want to use for your new column headings. The second argument, `values_from`, tells R which column should provide the cell values.

We should also **rename the column names** to match those in the codebook. Conveniently, we can use a function called `rename()` that works exactly like `select()` (following the pattern `new_name = old_name`), but it keeps all other column names the same rather than reducing the number of columns.

```{r}
sats_t1 <- sats_t1 %>% 
  pivot_wider(names_from = Subscale, values_from = mean_score) %>% 
  rename(SATS28_Affect_Time1_mean = Affect,
         SATS28_CognitiveCompetence_Time1_mean = CognitiveCompetence,
         SATS28_Value_Time1_mean = Value,
         SATS28_Difficulty_Time1_mean = Difficulty)
```

::: {.callout-caution collapse="true" icon="false"}

## Show final `sats_t1` output

```{r}
head(sats_t1, n = 5)
```

:::

Again, this could have been written up as a single pipe.

::: {.callout-caution collapse="true" icon="false"}

## Single pipe of activity 4

```{r message=FALSE}
sats_t1 <- data_prp %>% 
  # Step 1
  select(Code, SATS28_1_Affect_Time1:SATS28_28_Difficulty_Time1) %>% 
  # Step 2
  pivot_longer(cols = -Code, names_to = "Items", values_to = "Response") %>% 
  # Step 3
  separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_", convert = TRUE) %>% 
  # step 4
  mutate(FW_RV = case_when(
    Item_number %in% c(2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, 28) ~ "Reverse",
    .default = "Forward"
  ),
    Scores_corrected = case_when(
      FW_RV == "Reverse" ~ 8-Response,
      .default = Response
  )) %>% 
  # step 5
  group_by(Code, Subscale) %>% 
  summarise(mean_score = mean(Scores_corrected)) %>% 
  ungroup() %>% 
  # step 6
  pivot_wider(names_from = Subscale, values_from = mean_score) %>% 
  rename(SATS28_Affect_Time1_mean = Affect,
         SATS28_CognitiveCompetence_Time1_mean = CognitiveCompetence,
         SATS28_Value_Time1_mean = Value,
         SATS28_Difficulty_Time1_mean = Difficulty)
```

:::


## Activity 5 (Error Mode): Perceptions of supervisory support

#### The main goal is to compute the mean score for perceived supervisory support per participant. {.unnumbered}

Looking at the supervisory support data, you determine that

* individual item columns are `r mcq(c(answer = "numeric", x = "character"))`, and
* according to the codebook, there are `r mcq(c(x = "no", answer = "some"))` reverse-coded items in this questionnaire.

I have outlined my steps as follows:

* **Step 1**: Reverse-code the single column first because that's less hassle than having to do that with conditional statements (`Supervisor_15_R`). `mutate()` is my friend.
* **Step 2**: I want to filter out everyone who failed the attention check in `Supervisor_7`. I can do this with a Boolean expression within the `filter()` function. The correct response was "completely disagree" which is 1.
* **Step 3**: Select their id from time point 2 and all the columns that start with the word "super", apart from `Supervisor_7` and the original `Supervisor_15_R` column
* **Step 4**: pivot into long format so I can calculate the averages better
* **Step 5**: calculate the average scores per participant

I've started coding but there are some errors in my code. Help me find and fix all of them. Try to go through the code line by line and read the error messages.

```{r super_error, eval=FALSE}
super <- data_ppr %>% 
  mutate(Supervisor_15 = 9-supervisor_15_R) %>% 
  filter(Supervisor_7 = 1) %>% 
  select(Code, starts_with("Super"), -Supervisor_7, -Supervisor_15_R) 
pivot_wider(cols = -Code, names_to = "Item", values_to = "Response") %>% 
  group_by(Time2_Code) %>% 
  summarise(Mean_Supervisor_Support = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()

```

::: {.callout-note collapse="true" icon="false"}

## How many mistakes am I supposed to find?

There are 8 mistakes in the code.

:::

::: {.callout-caution collapse="true" icon="false"}

## Reveal solution

Did you spot all 8 mistakes? Let's go through them line by line.

```{r super_correct}
super <- data_prp %>% # spelling mistake in data object
  mutate(Supervisor_15 = 8-Supervisor_15_R) %>% # semantic error: 8 minus response for a 7-point scale and supervisor_15_R needs a capital S
  filter(Supervisor_7 == 1) %>% # needs a Boolean expression == instead of =
  select(Code, starts_with("Super"), -Supervisor_7, -Supervisor_15_R) %>% # no pipe at the end, the rest is actually legit
  pivot_longer(cols = -Code, names_to = "Item", values_to = "Response") %>% # pivot_longer instead of pivot_wider
  group_by(Code) %>% # Code rather than Time2_Code - the reduced dataset does not contain Time2_Code
  summarise(Mean_Supervisor_Support = mean(Response, na.rm = TRUE)) %>% # Score_corrected doesn't exist; needs to be Response
  ungroup()
```

* Note that the **semantic error** in line 2 will not give you an error message.
* Were you thrown off by the `starts_with("Super")` expression in line 4? `starts_with()` and `ends_with()` are great alternatives to selecting columns via `:` But, using `select(Code, Supervisor_1:Supervisor_6, Supervisor_8:Supervisor_14)` would have given us the same result. *[I admit, that one was perhaps a bit mean]*

:::

## Activity 6: Join everything together with `???_join()`

Time to join all the relevant data files into a single dataframe, which will be used in the next chapters on data visualization. There are four ways to join data: `inner_join()`, `left_join()`, `right_join()`, and `full_join()`. Each function behaves differently in terms of what information is retained from the two data objects. Here is a quick overview:

::: {.callout-note icon="false"}

## Info on mutating joins

You have 4 types of join functions you could make use of. Click on the panels to know more

::: panel-tabset

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

## `inner_join()`

`inner_join()` returns only the rows where the values in the column specified in the `by =` statement match in both tables.

![inner_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/inner-join.gif)

## `left_join()`

`left_join()` retains the complete first (left) table and adds values from the second (right) table that have matching values in the column specified in the `by =` statement. Rows in the left table with no match in the right table will have missing values (`NA`) in the new columns.

![left_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/left-join.gif)

## `right_join()`

`right_join()` retains the complete second (right) table and adds values from the first (left) table that have matching values in the column specified in the `by =` statement. Rows in the right table with no match in the left table will have missing values (`NA`) in the new columns.

![right_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/right-join.gif)

## `full_join()`

`full_join()` returns all rows and all columns from both tables. `NA` values fill unmatched rows.

![full_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/full-join.gif)

:::

:::

From our original `data_prp`, we need to select demographics data and all summarised questionnaire data from time point 2. Next, we will join this with all other aggregated datasets from time point 1 which are currently stored in separate data objects in the `Global Environment`.

While you may be familiar with `inner_join()` from last year, for this task, we want to retain all data from all the data objects. Therefore, we will use `full_join()`. Keep in mind, you can only join two data objects at a time, so the upcoming code chunk will involve a fair bit of piping and joining.

Note: Since I (Gaby) like my columns arranged in a meaningful way, I will use `select()` at the end to order them better.

```{r eval=FALSE}
data_prp_final <- data_prp %>% 
  select(Code:Plan_prereg, Pre_reg_group:Time2_Understanding_OS) %>% 
  full_join(qrp_t1) %>% 
  full_join(understanding_t1) %>% 
  full_join(sats_t1) %>% 
  full_join(super) %>% 
  select(Code:Plan_prereg, Pre_reg_group, SATS28_Affect_Time1_mean, SATS28_CognitiveCompetence_Time1_mean, SATS28_Value_Time1_mean, SATS28_Difficulty_Time1_mean, QRPs_Acceptance_Time1_mean, Time1_Understanding_OS, Other_OS_behav_2:Time2_Understanding_OS, Mean_Supervisor_Support)
```


::: {.callout-important icon="false"}
## No `by` argument in the code above? 

Note how I didn't include a `by` argument in the code above. If you leave `by =` out, R will join the 2 data objects by **ALL** columns that have the same name.

**Special case 1: matching column names but different values**

If you want more control, you should include the `by` argument; for example, if both data objects include a column `age` but data was recorded at 2 different time points. In that case, the information from both `age` columns should be retained and the `by` argument would not include `age`.

**Special case 2: different column names but matching values**

Another special case presents when both data objects contain identical information but the variable names don't match. Let's say, both data objects contain gender information, but in one data object the variable is named `gender` and in the other one `gender_label`. In that case, your `by` argument needs to be modified as: `by = join_by(gender == gender_label)`.

More info on joins can be found [https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/](https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/){target="_blank"}
:::

And this is basically the dataset we need for @sec-dataviz and @sec-dataviz2.


## Activity 7: Knit and export

Knit the `.Rmd` file to ensure everything runs as expected. Once it does, export the data object `data_prp_final` as a csv for use in the @sec-dataviz. Name it something meaningful, something like `data_prp_for_ch4.csv`.

::: {.callout-caution collapse="true" icon="false"}

## Solution

```{r eval=FALSE}
write_csv(data_prp_final, "data_prp_for_ch4.csv")
```

:::


## [Pair-coding]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}

```{r reading in data for me, echo=FALSE, message=FALSE}
library(tidyverse)

dog_data_raw <- read_csv("data/dog_data_raw.csv")
```

We will once again be working with data from Binfet et al. (2021), which focuses on the randomised controlled trials data involving therapy dog interventions. Today, our goal is to calculate the average `Loneliness` score for each participant measured at time point 1 (pre-intervention) using the raw data file `dog_data_raw`. Currently, the data looks like this:

```{r echo=FALSE}
loneliness_brief <- dog_data_raw %>% 
  select(RID, starts_with("L1"))

head(loneliness_brief, n = 5)
```

But we want the data to look like this:

```{r echo=FALSE}
loneliness_tidy <- dog_data_raw %>% 
  select(RID, starts_with("L1")) %>% 
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  mutate(Score_corrected = case_when(
    Names %in% c("L1_1", "L1_5", "L1_6", "L1_9", "L1_10", "L1_15", "L1_16", "L1_19", "L1_20") ~ 5-Response,
    .default = Response
    )) %>% 
  group_by(RID) %>% 
  summarise(Loneliness_pre = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()

head(loneliness_tidy, n = 5)
```

This task is a bit more challenging compared to last week's lab activity, as the Loneliness scale includes some reverse-coded items.

### Task 1: Open the R project for the lab {.unnumbered}

### Task 2: Open your `.Rmd` file from last week or create a new `.Rmd` file {.unnumbered}

You could continue the `.Rmd` file you used last week, or create a new `.Rmd`. If you need some guidance, have a look at @sec-rmd.

### Task 3: Load in the library and read in the data {.unnumbered}

The data should already be in your project folder. If you want a fresh copy, you can download the data again here: [data_pair_coding](data/data_pair_coding.zip "download").

We are using the package `tidyverse` today, and the datafile we should read in is `dog_data_raw.csv`.

::: {.callout-note collapse="true" icon="false"}

## Hint

```{r eval=FALSE}
# loading tidyverse into the library
library(???)

# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("???")
```

:::

### Task 4: Calculating the mean for `Loneliness_pre` {.unnumbered}

* **Step 1**: Select all relevant columns, such as the participant ID and all 20 items of the `Loneliness` questionnaire completed by participants before the intervention. Store this data in an object called `data_loneliness`.

::: {.callout-note collapse="true" icon="false"}

## Hint

Look at the codebook. Try to figure out

* the variable name of the column in which the participant id is stored, and
* which items relate to the Loneliness scale at Stage "pre"

::: {.callout-note collapse="true" icon="false"}

## More concrete hint

* the participant id column is called `RID`
* The Loneliness items at pre-intervention stage start with `L1_`

:::

:::

* **Step 2**: Pivot the data from wide format to long format so we can reverse-score and calculate the average score more easily (in step 3)

::: {.callout-note collapse="true" icon="false"}

## Hint

`pivot_`

We also need 3 arguments in that function:

* the columns we want to select (e.g., all the loneliness items),
* the name of the column in which the current column headings will be stored (e.g., "Qs"), and
* the name of the column that should store all the values (e.g., "Responses").

::: {.callout-note collapse="true" icon="false"}

## More concrete hint

```{r eval=FALSE}
  pivot_longer(cols = ???, names_to = "???", values_to = "???")
```

:::

:::

* **Step 3**: Reverse-scoring

Identify the items on the `Loneliness` scale that are reverse-coded, and then reverse-score them accordingly.

::: {.callout-note collapse="true" icon="false"}

## Hint

We need to figure out:

* which are the items of the loneliness scale we need to reverse-score
* what is the measuring scale of loneliness so we can determine the new values
* which function to use to create a new column that has the corrected scores in it
* which one of the `case_` functions will get us there

::: {.callout-note collapse="true" icon="false"}

## More concrete hint

* The items to be reverse-coded items can be found in the codebook: L1_1, L1_5, L1_6, L1_9, L1_10, L1_15, L1_16, L1_19, L1_20
* the loneliness scale ranges from 1 to 4, so we need to replace 1 with 4, 2 with 3, 3 with 2, and 4 with 1
* the function to create a new column `mutate()`
* it's a conditional statement rather than "just" replacing values, hence we need `case_when()`

```{r eval=FALSE}
  mutate(Score_corrected = case_when(
    ??? ~ ???,
    .default = ???
    ))
```

:::

:::

* **Step 4**: Calculate the average Loneliness score per participant. To match with the table above, we want to call this column `Loneliness_pre`

::: {.callout-note collapse="true" icon="false"}

## Hint

grouping and summarising

::: {.callout-note collapse="true" icon="false"}

## More concrete hint

```{r eval=FALSE}
  group_by(???) %>% 
  summarise(Loneliness_pre = ???(???)) %>% 
  ungroup()
```

:::

:::

::: {.callout-caution collapse="true" icon="false"}

## Solution

```{r eval=FALSE}
# loading tidyverse into the library
library(tidyverse)

# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("dog_data_raw.csv")

# Task 4: Tidying 
loneliness_tidy <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("L1")) %>% # select(RID, L1_1:L1_20) also works
  # Step 2
  pivot_longer(cols = -RID, names_to = "Qs", values_to = "Response") %>% 
  # Step 3
  mutate(Score_corrected = case_when(
    Qs %in% c("L1_1", "L1_5", "L1_6", "L1_9", "L1_10", "L1_15", "L1_16", "L1_19", "L1_20") ~ 5-Response,
    .default = Response
    )) %>% 
  # Step 4
  group_by(RID) %>% 
  summarise(Loneliness_pre = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()
```

:::

## [Test your knowledge and challenge yourself]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}

### Knowledge check {.unnumbered}


#### Question 1 {.unnumbered}

When using `mutate()`, which additional function could you use to recode an existing variable? `r mcq(c(x = "arrange()", x = "filter()", answer = "case_match()", x = "case_when()"))`


#### Question 2 {.unnumbered}

When using `mutate()`, which additional function could you use to create a new variable based on one or multiple conditional statements? `r mcq(c(x = "arrange()", x = "filter()", x = "case_match()", answer = "case_when()"))`


#### Question 3 {.unnumbered}

Which of the following functions would you use if you wanted to join two data sets by their shared identifier? `r mcq(c(answer = "inner_join()", x = "left_join()", x = "right_join()", x = "full_join()"))`


#### Question 4 {.unnumbered}

Your data object contains a column `Score` with numbers, but they have been read in incorrectly as a character datatype. Which of the following functions would *not* work for fixing this issue? `r mcq(c(x = "parse_number()", answer = "factor(Score)", x = "mutate(Score = as.numeric(Score))", x = "as.numeric()"))`

::: {.callout-caution collapse="true" icon="false"}

## Explain this answer

* `parse_number()` from the `readr` package extracts numeric values from strings, so this would work.
* `factor(Score)`: This would *not* work as expected because it converts the column into a factor, not a numeric datatype, leading to incorrect results if numeric operations are needed.
* `mutate(Score = as.numeric(Score))`: This would work too because `mutate()` can be used in combination with `as.numeric()` to create a new numeric column or override the existing character column.
* `as.numeric()`: This would also work to convert a character column to numeric. Without mutate, you could use it in a BaseR way, e.g., `data$Score <- as.numeric(data$Score)` (*shudder, BaseR!!! But effective*)

:::


### Challenge yourself {.unnumbered}

If you want to **challenge yourself** and further apply the skills from Chapter 3, you could wrangle the data from `dog_data_raw` for one of the other questionnaires. There are plenty of options to choose from:

::: {.callout-tip collapse="true" icon="false"}

## Difficulty level: easy

* recode column `Live_Pets` so the values read yes and no rather than 1 and 2
* recode `Year_of_Study` so they have the labels from the codebook rather than the numbers
* reverse-code the `Homesickness` scale for `_pre` and `_post`
* renaming the columns of the other one-item scales as `Stress_pre`, `Stress_post`, `Engagement_pre` and `Engagement_post`

Any of these tasks should be doable in one step. No need to select or pivot anything. You could just modify `dog_data_raw`.

::: {.callout-note collapse="true" icon="false"}

## Hints

* For the **recoding tasks**, you need to work out which function to use to recode one value as another - just plain replacing, no conditional statements
* The **reverse-coding** might sound daunting to do in one step, but it is only a single value that needs to be recoded. Take some inspiration from Activity 5 (error mode).
* For the **renaming tasks**, check how you would change column names without reducing the number of columns overall
:::

::: {.callout-caution collapse="true" icon="false"}

## Solution for **Challenge yourself - easy**

```{r eval=FALSE}
## Live_Pets
dog_data_raw <- dog_data_raw %>%
  mutate(Live_Pets = case_match(Live_Pets,
                                1 ~ "yes",
                                2 ~ "no"))
```

```{r eval=FALSE}
## Year of Study
dog_data_raw <- dog_data_raw %>%
  mutate(Year_of_Study = case_match(Year_of_Study,
                                    1 ~ "First",
                                    2 ~ "Second",
                                    3 ~ "Third",
                                    4 ~ "Fourth",
                                    5 ~ "Fifth or above"))
```

```{r eval=FALSE}
## Reverse-coding of homesickness pre and post. It's a 5-point scale, hence you'd calculate 6-the original response column
dog_data_raw <- dog_data_raw %>% 
  mutate(Homesick_pre = 6-HO1_1,
         Homesick_post = 6-HO2_1)
```

```{r eval=FALSE}
## Renaming of Stress and Engagement
dog_data_raw <- dog_data_raw %>% 
  rename(Stress_pre = S1_1, Stress_post = S2_1, Engagement_pre = HO1_2, Engagement_post = HO2_2)
```
:::
:::

::: {.callout-warning collapse="true" icon="false"}

## Difficulty level: medium

* reverse-code the Social connectedness scale (pre-intervention) and compute a mean score per participant

::: {.callout-note collapse="true" icon="false"}

## Hints

This task would take 4 steps to complete. These are the exact same steps we applied to `Loneliness_pre` in the lab activity. You would just need to figure out which items are related to the Social connectedness scale (pre-intervention) and which ones of those are reverse-coded. The codebook has all the answers.
:::

::: {.callout-caution collapse="true" icon="false"}

## Solution for **Challenge yourself - medium**

```{r eval=FALSE}
## SCS pre
scs_pre <- dog_data_raw %>% 
  select(RID, starts_with("SC1")) %>% 
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  mutate(Score_corrected = case_when(
    Names %in% c("SC1_3", "SC1_6", "SC1_7", "SC1_9", "SC1_11", "SC1_13", "SC1_15", "SC1_17", "SC1_18", "SC1_20") ~ 7-Response,
    .default = Response
    )) %>% 
  group_by(RID) %>% 
  summarise(SCS_pre = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()

```
:::
:::

::: {.callout-caution collapse="true" icon="false"}

## Difficulty level: hard

* reverse-code the Loneliness scale (post-intervention) and compute a mean score per participant
* reverse-code the Social connectedness scale (post-intervention) and compute a mean score per participant

Both activities are similar to Activity 3 from the individual walkthrough and would take about 5 steps to complete. **Start by mapping out the steps**.

::: {.callout-note collapse="true" icon="false"}

## Hints

* **Step 1**: Select all relevant columns, such as participant ID and all the items that belong to the questionnaire that participants completed after the intervention
* **Step 2**: Pivot the data from wide format to long format so we can reverse-score and calculate the average score more easily
* **Step 3**: Recode the initial responses so that the new column has numbers instead of labels
* **Step 4**: Reverse-score the items that are labelled as "Reverse" in the codebook and then reverse-score them
* **Step 5**: Group by and summarise to calculate the mean Score
:::

::: {.callout-caution collapse="true" icon="false"}

## Solution for **Challenge yourself - hard**

```{r eval=FALSE}
## loneliness post
lonely_post <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("L2")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  # Step 3
  mutate(Score = case_match(Response,
                            "never" ~ 1,
                            "rarely" ~ 2,
                            "sometimes" ~ 3,
                            "often" ~ 4,
                            .default = NA
  ),
  # Step 4 - we are still in the same mutate function (count the brackets)
        Score_corrected = case_when(
          Names %in% c("L2_1", "L2_5", "L2_6", "L2_9", "L2_10", "L2_15", "L2_16", "L2_19", "L2_20") ~ 5-Score,
          .default = Score
  )) %>% 
  # Step 5
  group_by(RID) %>% 
  summarise(Loneliness_post = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()

```

```{r eval=FALSE}
## SCS post
scs_post <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("SC2")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  # Step 3
  mutate(Response = case_match(Response,
                               "strongly disagree" ~ "1",
                               "strongly agree" ~ "6",
                               .default = Response),
         Response = parse_number(Response),
  # Step 4 - we are still in the same mutate function (count the brackets)
         Score_corrected = case_when(
           Names %in% c("SC2_3", "SC2_6", "SC2_7", "SC2_9", "SC2_11", "SC2_13", "SC2_15", "SC2_17", "SC2_18", "SC2_20") ~ 7-Response,
           .default = Response
         )) %>% 
  # Step 5
  group_by(RID) %>% 
  summarise(SCS_post = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()
```
:::
:::

::: {.callout-important collapse="true" icon="false"}

## Difficulty level: extra hard

* PANAS: positive and negative affect of pre- and post-intervention in a single pipe rather than in 4 different data objects (see last week's)

This task would take about 7 steps to get it from

```{r echo=FALSE}
PANAS <- dog_data_raw %>% 
  select(RID, starts_with("PN"))

head(PANAS, n = 5)
```

to

```{r echo=FALSE, message=FALSE}
PANAS <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("PN")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  separate(Items, into = c("Stage", "Item_number"), sep = "_", convert = TRUE) %>% 
  # Step 4 recode Stage
  mutate(Stage = case_match(Stage,
                            "PN1" ~ "pre",
                            "PN2" ~ "post")) %>% 
  # Step 5 identify subscales by item number
  mutate(Subscales = case_when(
    Item_number %in% c(3, 5, 7, 8, 10) ~ "PANAS_PA",
    .default = "PANAS_NA"
  )) %>% 
  # Step 6 
  group_by(RID, Stage, Subscales) %>% 
  summarise(Score = mean(Scores)) %>% 
  ungroup() %>% 
  # Step 7 - to make the data look like the data in `dog_data_clean_long.csv`
  pivot_wider(names_from = Subscales, values_from = Score)

head(PANAS, n = 5)
```

::: {.callout-note collapse="true" icon="false"}

## Hints

**Start by mapping out the steps**

* **Step 1**: select all relevant columns, such as participant ID and all the items that belong to PANAs scale (pos, neg, pre, and post)
* **Step 2**: pivot the data from wide format to long format. You want to do that for ALL columns that are not the participant id. The data object should have 3 columns and 5680 observations, i.e. each participant has 20 rows.
* **Step 3**: All of the items will have the structure `PN1_1`. Use separate to split the information across 2 columns. First column has information about the `Stage`, second column should turn into an `Item_number` and it should convert into a numeric column in the process to save you typing quotation marks in Step 5.\
* **Step 4**: recode the `Stage` column you just created so that everything that starts with PN1 relates to "pre" and PN2 as post.
* **Step 5**: identify the subscales positive affect (PA) and negative affect (NA) by item number and recode them. This requires a conditional statement.
* **Step 6**: group by and summarise to calculate the mean Score
* **Step 7**: pivot, so that you have the 2 PANAS subscales presented in separate columns (see table above). You might need an extra step if the columns aren't labelled exactly as shown in the table above.
:::

::: {.callout-caution collapse="true" icon="false"}

## Solution for **Challenge yourself - extra hard**

```{r eval=FALSE}
PANAS <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("PN")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  separate(Items, into = c("Stage", "Item_number"), sep = "_", convert = TRUE) %>% 
  # Step 4 recode Stage
  mutate(Stage = case_match(Stage,
                            "PN1" ~ "pre",
                            "PN2" ~ "post")) %>% 
  # Step 5 identify subscales by item number
  mutate(Subscales = case_when(
    Item_number %in% c(3, 5, 7, 8, 10) ~ "PANAS_PA",
    .default = "PANAS_NA"
  )) %>% 
  # Step 6 
  group_by(RID, Stage, Subscales) %>% 
  summarise(Score = mean(Scores)) %>% 
  ungroup() %>% 
  # Step 7 - to make the data look like the data in `dog_data_clean_long.csv`
  pivot_wider(names_from = Subscales, values_from = Score)
```
:::
:::