-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-wrangling2.qmd
1150 lines (788 loc) · 48.1 KB
/
03-wrangling2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data wrangling II {#sec-wrangling2}
## Intended Learning Outcomes {.unnumbered}
By the end of this chapter, you should be able to:
- apply familiar data wrangling functions to novel datasets
- read and interpret error messages
- realise there are several ways of getting to the results
In this chapter, we will pick up where we left off in @sec-wrangling. We will calculate average scores for two of the questionnaires, address an error mode problem, and finally, join all data objects together. This will finalise our data for the upcoming data visualization sections (@sec-dataviz and @sec-dataviz2).
## [Individual Walkthrough]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}
## Activity 1: Setup
* Go to the project folder we have been using in the last two weeks and double-click on the project icon to **open the project** in RStudio
* Either **Create a new `.Rmd` file** for chapter 3 and save it to your project folder or continue the one from last week. See @sec-rmd if you need some guidance.
## Activity 2: Load in the libraries and read in the data
Today, we will be using `tidyverse` along with the two csv files created at the end of the last chapter: `data_prp_for_ch3.csv` and `qrp_t1.csv`. If you need to download them again for any reason, click on the following links: [data_prp_for_ch3.csv](data/data_prp_for_ch3.csv "download") and [qrp_t1.csv](data/qrp_t1.csv "download").
::: {.callout-note collapse="true" icon="false"}
## Hint
```{r eval=FALSE}
library(???)
data_prp <- read_csv("???")
qrp_t1 <- read_csv("???")
```
```{r include=FALSE, message=TRUE}
## I basically have to have 2 code chunks since I tell them to put the data files next to the project, and mine are in a separate folder called data - unless I'll turn this into a fixed path
library(tidyverse)
data_prp <- read_csv("data/prp_data_reduced.csv")
qrp_t1 <- read_csv("data/qrp_t1.csv")
```
:::
::: {.callout-caution collapse="true" icon="false"}
## Solution
```{r eval=FALSE}
library(tidyverse)
data_prp <- read_csv("prp_data_reduced.csv")
qrp_t1 <- read_csv("qrp_t1.csv")
```
:::
If you need a quick reminder what the dataset was about, have a look at the abstract in @sec-download_data_ch1. We also addressed the changes we made to the dataset there.
And remember to have a quick `glimpse()` at your data.
## Activity 3: Confidence in understanding Open Science practices
#### The main goal is to compute the mean Understanding score per participant. {.unnumbered}
The mean Understanding score for time point 2 has already been calculated (in the `Time2_Understanding_OS` column), but we still need to compute it for time point 1.
Looking at the Understanding data at time point 1, you determine that
* individual item columns are `r mcq(c(x = "numeric", answer = "character"))`, and
* according to the codebook, there are `r mcq(c(answer = "no", x = "some"))` reverse-coded items in this questionnaire.
The steps are quite similar to those for QRP, but we need to add an extra step: converting the character labels into numbers.
Again, let's do this step by step:
* **Step 1**: Select the relevant columns `Code`, and every Understanding column from time point 1 (e.g., from `Understanding_OS_1_Time1` to `Understanding_OS_12_Time1`) and store them in an object called `understanding_t1`
* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can recode the labels into values (step 3) and calculate the average score (in step 4) more easily
* **Step 3**: Recode the values "Not at all confident" as 1 and "Entirely confident" as 7. All other values are already numbers. We can use functions `mutate()` in combination with `case_match()` for that
* **Step 4**: Calculate the average Understanding Open Science score (`Time1_Understanding_OS`) per participant using `group_by()` and `summarise()`
#### Steps 1 and 2: Select and pivot {.unnumbered}
How about you try the first 2 steps yourself using the code from Chapter 2 Activity 4 (@sec-ch2_act4) as a template?
```{r understanding, eval=FALSE}
understanding_t1 <- data_prp %>%
select(???) %>% # Step 1
pivot_longer(cols = ???, names_to = "???", values_to = "???") # Step 2
```
::: {.callout-caution collapse="true" icon="false"}
## Solution for steps 1 and 2
```{r}
understanding_t1 <- data_prp %>%
# Step 1
select(Code, Understanding_OS_1_Time1:Understanding_OS_12_Time1) %>%
# Step 2 - I picked different column labels this time for some variety
pivot_longer(cols = Understanding_OS_1_Time1:Understanding_OS_12_Time1, names_to = "Understanding_Qs", values_to = "Responses")
```
:::
#### Step 3: recoding the values {.unnumbered}
OK, we now want to recode the values in the `Responses` column (or whatever name you picked for your column that has some of the numbers in it) so that "Not at all confident" = 1 and "Entirely confident" = 7. We want to keep all other values as they are (2-6 look already quite "numeric").
Let's create a new column `Responses_corrected` that stores the new values with `mutate()`. Then we can combine that with the `case_match()` function.
* The first argument in `case_match()` is the column name of the variable you want to recode.
* Then you can start recoding the values in the way of `CurrentValue ~ NewValue` (~ is a tilde). Make sure you use the `~` and not `=`!
* Lastly, the `.default` argument tells R what to do with values that are neither "Not at all confident" nor "Entirely confident". Here, we want to replace them with the original value of the `Responses` column. In other datasets, you may want to set the default to `NA` for missing values, a character string or a number, and `case_match()` is happy to oblige.
```{r error=TRUE}
understanding_t1 <- understanding_t1 %>%
mutate(Responses_corrected = case_match(Responses, # column of the values to recode
"Not at all confident" ~ 1, # values to recode
"Entirely confident" ~ 7,
.default = Responses # all other values taken from column Responses
))
```
::: {.callout-important collapse="true"}
## Error!!! Can you explain what is happening here?
Have a look at the error message. It's pretty helpful this time. It says `Can't combine ..1 (right) <double> and .default <character>.` It means that the replacement values are expected to be data type character since the original column type was type character.
:::
**So how do we fix this?** Actually, there are several ways this could be done. Click on the tabs below to check out 3 possible solutions.
::: {.panel-tabset group="layers"}
## Fix option 1
One option is to modify the `.default` argument `Responses` so that the values are copied over from the original column but as a number rather than the original character value. The function `as.numeric()` does the conversion.
```{r warning=FALSE}
understanding_t1_step3_v1 <- understanding_t1 %>%
mutate(Responses_corrected = case_match(Responses, # column of the values to recode
"Not at all confident" ~ 1, # values to recode
"Entirely confident" ~ 7,
.default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type
))
```
## Fix option 2
Change the numeric values on the right side of the `~` to character. Then in a second step, we would need to turn the character column into a numeric type. Again, we have several options to do so. We could either use the `parse_number()` function we encountered earlier during the demographics wrangling or the `as.numeric()` function.
* V1: `Responses_corrected = parse_number(Responses_corrected)`
* V2: `Responses_corrected = as.numeric(Responses_corrected)`
Just pay attention that you are still working *within* the `mutate()` function.
```{r}
understanding_t1_step3_v2 <- understanding_t1 %>%
mutate(Responses_corrected = case_match(Responses, # column of the values to recode
"Not at all confident" ~ "1",
"Entirely confident" ~ "7",
.default = Responses # all other values taken from column Responses (character)
),
Responses_corrected = parse_number(Responses_corrected)) # turning Responses_corrected into a numeric column
```
## Fix option 3
If you recode all the labels into numbers (e.g., "2" into 2, "3" into 3, etc.) from the start, you won’t need to perform any additional conversions later.
```{r}
understanding_t1_step3_v2 <- understanding_t1 %>%
mutate(Responses_recoded = case_match(Responses, # column of the values to recode
"Not at all confident" ~ 1, # recode all of them
"2" ~ 2,
"3" ~ 3,
"4" ~ 4,
"5" ~ 5,
"6" ~ 6,
"Entirely confident" ~ 7))
```
:::
::: {.callout-note icon="false"}
## Your Turn
Choose the option that works best for you to **modify the code of `understanding_t1`** above that didn't work/ gave you an error message. Once you do that, you should be able to calculate the **mean Understanding Score per participant**. Store the average scores in a variable called `Time1_Understanding_OS`. If you need help, refer to the hint below or use Chapter 2 Activity 4 (@sec-ch2_act4) as guidance.
::: {.callout-caution icon="false" collapse="true"}
## One solution for Steps 3 and 4
```{r warning=FALSE}
understanding_t1 <- understanding_t1 %>%
# Step 3
mutate(Responses_corrected = case_match(Responses, # column of the values to recode
"Not at all confident" ~ 1, # values to recode
"Entirely confident" ~ 7,
.default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type
)) %>%
# Step 4: calculating averages per participant
group_by(Code) %>%
summarise(Time1_Understanding_OS = mean(Responses_corrected)) %>%
ungroup()
```
:::
:::
Of course, this could have been written up as a single pipe.
::: {.callout-caution collapse="true" icon="false"}
## Single pipe of activity 3
```{r message=FALSE, warning=FALSE}
understanding_t1 <- data_prp %>%
# Step 1
select(Code, Understanding_OS_1_Time1:Understanding_OS_12_Time1) %>%
# Step 2
pivot_longer(cols = -Code, names_to = "Understanding_Qs", values_to = "Responses") %>%
# Step 3
mutate(Responses_corrected = case_match(Responses, # column of the values to recode
"Not at all confident" ~ 1, # values to recode
"Entirely confident" ~ 7,
.default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type
)) %>%
# Step 4
group_by(Code) %>%
summarise(Time1_Understanding_OS = mean(Responses_corrected)) %>%
ungroup()
```
:::
## Activity 4: Survey of Attitudes Toward Statistics (SATS-28)
#### The main goal is to compute the mean SATS-28 score for each of the 4 subscales per participant for time point 1. {.unnumbered}
Looking at the SATS data at time point 1, you determine that
* individual item columns are `r mcq(c(answer = "numeric", x = "character"))`, and
* according to the codebook, there are `r mcq(c(x = "no", answer = "some"))` reverse-coded items in this questionnaire.
* Additionally, we are looking to compute the means for the 4 different subscales of the SAT-28 which are `r fitb("Affect",ignore_case = TRUE)`, `r fitb("Cognitive Competence",ignore_case = TRUE)`, `r fitb("Value",ignore_case = TRUE)`, and `r fitb("Difficulty",ignore_case = TRUE)`.
This scenario is slightly more tricky than the previous ones due to the reverse-coding and the 4 subscales. So, let's tackle this step by step again:
* **Step 1**: Select the relevant columns `Code`, and every SATS28 column from time point 1 (e.g., from `SATS28_1_Affect_Time1` to `SATS28_28_Difficulty_Time1`) and store them in an object called `sats_t1`
* **Step 2**: Pivot the data from wide format to long format using `pivot_longer()` so we can recode the labels into values (step 3) and calculate the average score (in step 4) more easily
* **Step 3**: We need to know which items belong to which subscale - fortunately, we have that information in the variable name and can use the `separate()` function to access it.
* **Step 4**: We need to know which items are reverse-coded and then reverse-score them - unfortunately, the info is only in the codebook and we need to find a work-around. `case_when()` can help identify and re-score the reverse-coded items.
* **Step 5**: Calculate the average SATS score per participant and subscale using `group_by()` and `summarise()`
* **Step 6**: use `pivot_wider()` to spread out the dataframe into wide format and `rename()` to tidy up the column names
#### Steps 1 and 2: select and pivot {.unnumbered}
The selecting and pivoting are exactly the same way as we already practiced in the other 2 questionnaires. Apply them here to this questionnaire.
::: {.callout-note collapse="true" icon="false"}
## Hint
```{r SATS28, eval=FALSE}
sats_t1 <- data_prp %>%
select(???) %>% # Step 1
pivot_longer(cols = ???, names_to = "???", values_to = "???") # Step 2
```
::: {.callout-caution collapse="true" icon="false"}
## Solution for steps 1 and 2
```{r}
sats_t1 <- data_prp %>%
select(Code, SATS28_1_Affect_Time1:SATS28_28_Difficulty_Time1) %>% # Step 1
pivot_longer(cols = -Code, names_to = "Items", values_to = "Response") # Step 2
```
:::
:::
#### Step 3: separate Subscale information {.unnumbered}
If you look at the `Items` column more closely, you can see that there is information on the `Questionnaire`, the `Item_number`, the `Subscale`, and the `Timepoint` the data was collected at.
We can separate the information into separate columns using the `separate()` function. The function's first argument is the column to separate, then define `into` which columns you want the original column to split up, and lastly, define the separator `sep` (here an underscore). For our example, we would write:
* V1: `separate(Items, into = c("SATS", "Item_number", "Subscale", "Time"), sep = "_")`
However, we don't need all of those columns, so we could just drop the ones we are not interested in by replacing them with `NA`.
* V2: `separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_")`
We might also add an extra argument of `convert = TRUE` to have numeric columns (i.e., `Item_number`) converted to numeric as opposed to keeping them as characters. Saves us typing a few quotation marks later in Step 4.
```{r}
sats_t1 <- sats_t1 %>%
# Step 3
separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_", convert = TRUE)
```
#### Step 4: identifying reverse-coded items and then correct them {.unnumbered}
We can use `case_when()` within the `mutate()` function here to create a new column `FW_RV` that stores information on whether the item is a reverse-coded item or not.
`case_when()` works similarly to `case_match()`, however `case_match()` only allows us to "recode" values (i.e., replace one value with another), whereas `case_when()` is more flexible. It allows us to use **conditional statements** on the left side of the tilde which is useful when you want to change only *some* of the data based on specific conditions.
Looking at the codebook, it seems that items 2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, and 28 are reverse-coded. The rest are forward-coded.
We want to tell R now, that
* **if** the `Item_number` is any of those numbers listed above, R should write "Reverse" into the new column `FW_RV` we are creating. Since we have a few possible matches for `Item_number`, we need the Boolean expression `%in%` rather than `==`.
* **if** `Item_number` is none of those numbers, then we would like the word "Forward" in the `FW_RV` column to appear. We can achieve that by specifying a `.default` argument again, but this time we want a "word" rather than a value from another column.
```{r}
sats_t1 <- sats_t1 %>%
mutate(FW_RV = case_when(
Item_number %in% c(2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, 28) ~ "Reverse",
.default = "Forward"
))
```
Moving on to correcting the scores: Once again, we can use `case_when ()` within the `mutate()` function to create another **conditional statement**. This time, the condition is:
* **if** `FW_RV` column has a value of "Reverse" then we would like to turn all 1 into 7, 2 into 6, etc.
* **if** `FW_RV` column has a value of "Forward" then we would like to keep the score from the `Response` column
There is a quick way and a not-so-quick way to achieve the actual **reverse-coding**.
* **Option 1 (quick)**: The easiest way to reverse-code scores is to take the maximum value of the scale, add 1 unit, and subtract the original value. For example, on a 5-point Likert scale, it would be 6 minus the original rating; for a 7-point Likert scale, 8 minus the original rating, etc. (see *Option 1* tab).
* **Option 2 (not so quick)**: This involves using two conditional statements (see *Option 2* tab).
Use the one you find more intuitive.
::: panel-tabset
## Option 1
Here we are using a Boolean expression to check if the string "Reverse" is present in the `FW_RV` column. If this condition is `TRUE`, the value in the new column we're creating, `Scores_corrected`, will be calculated as 8 minus the value from the Response column. If the condition is FALSE (handled by the .default argument), the original values from the `Response` column will be retained.
```{r}
sats_t1 <- sats_t1 %>%
mutate(Scores_corrected = case_when(
FW_RV == "Reverse" ~ 8-Response,
.default = Response
))
```
## Option 2
As stated above, the longer approach involves using two conditional statements. The first condition checks if the value in the `FW_RV` column is "Reverse", while the second condition checks if the value in the `Response` column equals a specific number. **When both conditions are met**, the corresponding value on the right side of the tilde is placed in the newly created `Scores_corrected_v2` column.
For example, line 3 would read: if the value in the `FW_RV` column is "Reverse" **AND** the value in the `Response` column is 1, then assign a value of 7 to the `Scores_corrected_v2` column.
```{r}
sats_t1 <- sats_t1 %>%
mutate(Scores_corrected_v2 = case_when(
FW_RV == "Reverse" & Response == 1 ~ 7,
FW_RV == "Reverse" & Response == 2 ~ 6,
FW_RV == "Reverse" & Response == 3 ~ 5,
# no need to recode 4 as 4
FW_RV == "Reverse" & Response == 5 ~ 3,
FW_RV == "Reverse" & Response == 6 ~ 2,
FW_RV == "Reverse" & Response == 7 ~ 1,
.default = Response
))
```
As you can see now in `sats_t1`, both columns `Scores_corrected` and `Scores_corrected_v2` are identical.
:::
One way to **check whether our reverse-coding worked** is by examining the `distinct` values in the original `Response` column and comparing them with the `Scores_corrected`. We should also retain the `FW_RV` column to observe how the reverse-coding applied.
To see the patterns more clearly, we can use `arrange()` to sort the values in a meaningful order. Remember, the default sorting order is ascending, so if you want to sort values in descending order, you’ll need to wrap your variable in the desc() function.
```{r}
check_coding <- sats_t1 %>%
distinct(FW_RV, Response, Scores_corrected) %>%
arrange(desc(FW_RV), Response)
```
::: {.callout-caution collapse="true" icon="false"}
## Show `check_coding` output
```{r}
check_coding
```
:::
#### Step 5 {.unnumbered}
Now that we know everything worked out as intended, we can calculate the mean scores of each subscale for each participant in `sats_t1`.
::: {.callout-note collapse="true" icon="false"}
## Hint
```{r eval=FALSE}
sats_t1 <- sats_t1 %>%
group_by(???, ???) %>%
summarise(mean_score = ???(???)) %>%
ungroup()
```
::: {.callout-caution collapse="true" icon="false"}
## Solution
```{r}
sats_t1 <- sats_t1 %>%
group_by(Code, Subscale) %>%
summarise(mean_score = mean(Scores_corrected)) %>%
ungroup()
```
:::
:::
#### Step 6 {.unnumbered}
The final step is to transform the data back into wide format, ensuring that each subscale has its own column. This will make it easier to join the data objects later on. In `pivot_wider()`, the first argument, `names_from`, specifies the column you want to use for your new column headings. The second argument, `values_from`, tells R which column should provide the cell values.
We should also **rename the column names** to match those in the codebook. Conveniently, we can use a function called `rename()` that works exactly like `select()` (following the pattern `new_name = old_name`), but it keeps all other column names the same rather than reducing the number of columns.
```{r}
sats_t1 <- sats_t1 %>%
pivot_wider(names_from = Subscale, values_from = mean_score) %>%
rename(SATS28_Affect_Time1_mean = Affect,
SATS28_CognitiveCompetence_Time1_mean = CognitiveCompetence,
SATS28_Value_Time1_mean = Value,
SATS28_Difficulty_Time1_mean = Difficulty)
```
::: {.callout-caution collapse="true" icon="false"}
## Show final `sats_t1` output
```{r}
head(sats_t1, n = 5)
```
:::
Again, this could have been written up as a single pipe.
::: {.callout-caution collapse="true" icon="false"}
## Single pipe of activity 4
```{r message=FALSE}
sats_t1 <- data_prp %>%
# Step 1
select(Code, SATS28_1_Affect_Time1:SATS28_28_Difficulty_Time1) %>%
# Step 2
pivot_longer(cols = -Code, names_to = "Items", values_to = "Response") %>%
# Step 3
separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_", convert = TRUE) %>%
# step 4
mutate(FW_RV = case_when(
Item_number %in% c(2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, 28) ~ "Reverse",
.default = "Forward"
),
Scores_corrected = case_when(
FW_RV == "Reverse" ~ 8-Response,
.default = Response
)) %>%
# step 5
group_by(Code, Subscale) %>%
summarise(mean_score = mean(Scores_corrected)) %>%
ungroup() %>%
# step 6
pivot_wider(names_from = Subscale, values_from = mean_score) %>%
rename(SATS28_Affect_Time1_mean = Affect,
SATS28_CognitiveCompetence_Time1_mean = CognitiveCompetence,
SATS28_Value_Time1_mean = Value,
SATS28_Difficulty_Time1_mean = Difficulty)
```
:::
## Activity 5 (Error Mode): Perceptions of supervisory support
#### The main goal is to compute the mean score for perceived supervisory support per participant. {.unnumbered}
Looking at the supervisory support data, you determine that
* individual item columns are `r mcq(c(answer = "numeric", x = "character"))`, and
* according to the codebook, there are `r mcq(c(x = "no", answer = "some"))` reverse-coded items in this questionnaire.
I have outlined my steps as follows:
* **Step 1**: Reverse-code the single column first because that's less hassle than having to do that with conditional statements (`Supervisor_15_R`). `mutate()` is my friend.
* **Step 2**: I want to filter out everyone who failed the attention check in `Supervisor_7`. I can do this with a Boolean expression within the `filter()` function. The correct response was "completely disagree" which is 1.
* **Step 3**: Select their id from time point 2 and all the columns that start with the word "super", apart from `Supervisor_7` and the original `Supervisor_15_R` column
* **Step 4**: pivot into long format so I can calculate the averages better
* **Step 5**: calculate the average scores per participant
I've started coding but there are some errors in my code. Help me find and fix all of them. Try to go through the code line by line and read the error messages.
```{r super_error, eval=FALSE}
super <- data_ppr %>%
mutate(Supervisor_15 = 9-supervisor_15_R) %>%
filter(Supervisor_7 = 1) %>%
select(Code, starts_with("Super"), -Supervisor_7, -Supervisor_15_R)
pivot_wider(cols = -Code, names_to = "Item", values_to = "Response") %>%
group_by(Time2_Code) %>%
summarise(Mean_Supervisor_Support = mean(Score_corrected, na.rm = TRUE)) %>%
ungroup()
```
::: {.callout-note collapse="true" icon="false"}
## How many mistakes am I supposed to find?
There are 8 mistakes in the code.
:::
::: {.callout-caution collapse="true" icon="false"}
## Reveal solution
Did you spot all 8 mistakes? Let's go through them line by line.
```{r super_correct}
super <- data_prp %>% # spelling mistake in data object
mutate(Supervisor_15 = 8-Supervisor_15_R) %>% # semantic error: 8 minus response for a 7-point scale and supervisor_15_R needs a capital S
filter(Supervisor_7 == 1) %>% # needs a Boolean expression == instead of =
select(Code, starts_with("Super"), -Supervisor_7, -Supervisor_15_R) %>% # no pipe at the end, the rest is actually legit
pivot_longer(cols = -Code, names_to = "Item", values_to = "Response") %>% # pivot_longer instead of pivot_wider
group_by(Code) %>% # Code rather than Time2_Code - the reduced dataset does not contain Time2_Code
summarise(Mean_Supervisor_Support = mean(Response, na.rm = TRUE)) %>% # Score_corrected doesn't exist; needs to be Response
ungroup()
```
* Note that the **semantic error** in line 2 will not give you an error message.
* Were you thrown off by the `starts_with("Super")` expression in line 4? `starts_with()` and `ends_with()` are great alternatives to selecting columns via `:` But, using `select(Code, Supervisor_1:Supervisor_6, Supervisor_8:Supervisor_14)` would have given us the same result. *[I admit, that one was perhaps a bit mean]*
:::
## Activity 6: Join everything together with `???_join()`
Time to join all the relevant data files into a single dataframe, which will be used in the next chapters on data visualization. There are four ways to join data: `inner_join()`, `left_join()`, `right_join()`, and `full_join()`. Each function behaves differently in terms of what information is retained from the two data objects. Here is a quick overview:
::: {.callout-note icon="false"}
## Info on mutating joins
You have 4 types of join functions you could make use of. Click on the panels to know more
::: panel-tabset
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
## `inner_join()`
`inner_join()` returns only the rows where the values in the column specified in the `by =` statement match in both tables.
![inner_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/inner-join.gif)
## `left_join()`
`left_join()` retains the complete first (left) table and adds values from the second (right) table that have matching values in the column specified in the `by =` statement. Rows in the left table with no match in the right table will have missing values (`NA`) in the new columns.
![left_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/left-join.gif)
## `right_join()`
`right_join()` retains the complete second (right) table and adds values from the first (left) table that have matching values in the column specified in the `by =` statement. Rows in the right table with no match in the left table will have missing values (`NA`) in the new columns.
![right_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/right-join.gif)
## `full_join()`
`full_join()` returns all rows and all columns from both tables. `NA` values fill unmatched rows.
![full_join(): gif by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/){target="_blank"}](images/full-join.gif)
:::
:::
From our original `data_prp`, we need to select demographics data and all summarised questionnaire data from time point 2. Next, we will join this with all other aggregated datasets from time point 1 which are currently stored in separate data objects in the `Global Environment`.
While you may be familiar with `inner_join()` from last year, for this task, we want to retain all data from all the data objects. Therefore, we will use `full_join()`. Keep in mind, you can only join two data objects at a time, so the upcoming code chunk will involve a fair bit of piping and joining.
Note: Since I (Gaby) like my columns arranged in a meaningful way, I will use `select()` at the end to order them better.
```{r eval=FALSE}
data_prp_final <- data_prp %>%
select(Code:Plan_prereg, Pre_reg_group:Time2_Understanding_OS) %>%
full_join(qrp_t1) %>%
full_join(understanding_t1) %>%
full_join(sats_t1) %>%
full_join(super) %>%
select(Code:Plan_prereg, Pre_reg_group, SATS28_Affect_Time1_mean, SATS28_CognitiveCompetence_Time1_mean, SATS28_Value_Time1_mean, SATS28_Difficulty_Time1_mean, QRPs_Acceptance_Time1_mean, Time1_Understanding_OS, Other_OS_behav_2:Time2_Understanding_OS, Mean_Supervisor_Support)
```
::: {.callout-important icon="false"}
## No `by` argument in the code above?
Note how I didn't include a `by` argument in the code above. If you leave `by =` out, R will join the 2 data objects by **ALL** columns that have the same name.
**Special case 1: matching column names but different values**
If you want more control, you should include the `by` argument; for example, if both data objects include a column `age` but data was recorded at 2 different time points. In that case, the information from both `age` columns should be retained and the `by` argument would not include `age`.
**Special case 2: different column names but matching values**
Another special case presents when both data objects contain identical information but the variable names don't match. Let's say, both data objects contain gender information, but in one data object the variable is named `gender` and in the other one `gender_label`. In that case, your `by` argument needs to be modified as: `by = join_by(gender == gender_label)`.
More info on joins can be found [https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/](https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/){target="_blank"}
:::
And this is basically the dataset we need for @sec-dataviz and @sec-dataviz2.
## Activity 7: Knit and export
Knit the `.Rmd` file to ensure everything runs as expected. Once it does, export the data object `data_prp_final` as a csv for use in the @sec-dataviz. Name it something meaningful, something like `data_prp_for_ch4.csv`.
::: {.callout-caution collapse="true" icon="false"}
## Solution
```{r eval=FALSE}
write_csv(data_prp_final, "data_prp_for_ch4.csv")
```
:::
## [Pair-coding]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}
```{r reading in data for me, echo=FALSE, message=FALSE}
library(tidyverse)
dog_data_raw <- read_csv("data/dog_data_raw.csv")
```
We will once again be working with data from Binfet et al. (2021), which focuses on the randomised controlled trials data involving therapy dog interventions. Today, our goal is to calculate the average `Loneliness` score for each participant measured at time point 1 (pre-intervention) using the raw data file `dog_data_raw`. Currently, the data looks like this:
```{r echo=FALSE}
loneliness_brief <- dog_data_raw %>%
select(RID, starts_with("L1"))
head(loneliness_brief, n = 5)
```
But we want the data to look like this:
```{r echo=FALSE}
loneliness_tidy <- dog_data_raw %>%
select(RID, starts_with("L1")) %>%
pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>%
mutate(Score_corrected = case_when(
Names %in% c("L1_1", "L1_5", "L1_6", "L1_9", "L1_10", "L1_15", "L1_16", "L1_19", "L1_20") ~ 5-Response,
.default = Response
)) %>%
group_by(RID) %>%
summarise(Loneliness_pre = mean(Score_corrected, na.rm = TRUE)) %>%
ungroup()
head(loneliness_tidy, n = 5)
```
This task is a bit more challenging compared to last week's lab activity, as the Loneliness scale includes some reverse-coded items.
### Task 1: Open the R project for the lab {.unnumbered}
### Task 2: Open your `.Rmd` file from last week or create a new `.Rmd` file {.unnumbered}
You could continue the `.Rmd` file you used last week, or create a new `.Rmd`. If you need some guidance, have a look at @sec-rmd.
### Task 3: Load in the library and read in the data {.unnumbered}
The data should already be in your project folder. If you want a fresh copy, you can download the data again here: [data_pair_coding](data/data_pair_coding.zip "download").
We are using the package `tidyverse` today, and the datafile we should read in is `dog_data_raw.csv`.
::: {.callout-note collapse="true" icon="false"}
## Hint
```{r eval=FALSE}
# loading tidyverse into the library
library(???)
# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("???")
```
:::
### Task 4: Calculating the mean for `Loneliness_pre` {.unnumbered}
* **Step 1**: Select all relevant columns, such as the participant ID and all 20 items of the `Loneliness` questionnaire completed by participants before the intervention. Store this data in an object called `data_loneliness`.
::: {.callout-note collapse="true" icon="false"}
## Hint
Look at the codebook. Try to figure out
* the variable name of the column in which the participant id is stored, and
* which items relate to the Loneliness scale at Stage "pre"
::: {.callout-note collapse="true" icon="false"}
## More concrete hint
* the participant id column is called `RID`
* The Loneliness items at pre-intervention stage start with `L1_`
:::
:::
* **Step 2**: Pivot the data from wide format to long format so we can reverse-score and calculate the average score more easily (in step 3)
::: {.callout-note collapse="true" icon="false"}
## Hint
`pivot_`
We also need 3 arguments in that function:
* the columns we want to select (e.g., all the loneliness items),
* the name of the column in which the current column headings will be stored (e.g., "Qs"), and
* the name of the column that should store all the values (e.g., "Responses").
::: {.callout-note collapse="true" icon="false"}
## More concrete hint
```{r eval=FALSE}
pivot_longer(cols = ???, names_to = "???", values_to = "???")
```
:::
:::
* **Step 3**: Reverse-scoring
Identify the items on the `Loneliness` scale that are reverse-coded, and then reverse-score them accordingly.
::: {.callout-note collapse="true" icon="false"}
## Hint
We need to figure out:
* which are the items of the loneliness scale we need to reverse-score
* what is the measuring scale of loneliness so we can determine the new values
* which function to use to create a new column that has the corrected scores in it
* which one of the `case_` functions will get us there
::: {.callout-note collapse="true" icon="false"}
## More concrete hint
* The items to be reverse-coded items can be found in the codebook: L1_1, L1_5, L1_6, L1_9, L1_10, L1_15, L1_16, L1_19, L1_20
* the loneliness scale ranges from 1 to 4, so we need to replace 1 with 4, 2 with 3, 3 with 2, and 4 with 1
* the function to create a new column `mutate()`
* it's a conditional statement rather than "just" replacing values, hence we need `case_when()`
```{r eval=FALSE}
mutate(Score_corrected = case_when(
??? ~ ???,
.default = ???
))
```
:::
:::
* **Step 4**: Calculate the average Loneliness score per participant. To match with the table above, we want to call this column `Loneliness_pre`
::: {.callout-note collapse="true" icon="false"}
## Hint
grouping and summarising
::: {.callout-note collapse="true" icon="false"}
## More concrete hint
```{r eval=FALSE}
group_by(???) %>%
summarise(Loneliness_pre = ???(???)) %>%
ungroup()
```
:::
:::
::: {.callout-caution collapse="true" icon="false"}
## Solution
```{r eval=FALSE}
# loading tidyverse into the library
library(tidyverse)
# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("dog_data_raw.csv")
# Task 4: Tidying
loneliness_tidy <- dog_data_raw %>%
# Step 1
select(RID, starts_with("L1")) %>% # select(RID, L1_1:L1_20) also works
# Step 2
pivot_longer(cols = -RID, names_to = "Qs", values_to = "Response") %>%
# Step 3
mutate(Score_corrected = case_when(
Qs %in% c("L1_1", "L1_5", "L1_6", "L1_9", "L1_10", "L1_15", "L1_16", "L1_19", "L1_20") ~ 5-Response,
.default = Response
)) %>%
# Step 4
group_by(RID) %>%
summarise(Loneliness_pre = mean(Score_corrected, na.rm = TRUE)) %>%
ungroup()
```
:::
## [Test your knowledge and challenge yourself]{style="color: #F39C12; text-transform: uppercase;"} {.unnumbered}
### Knowledge check {.unnumbered}
#### Question 1 {.unnumbered}
When using `mutate()`, which additional function could you use to recode an existing variable? `r mcq(c(x = "arrange()", x = "filter()", answer = "case_match()", x = "case_when()"))`
#### Question 2 {.unnumbered}
When using `mutate()`, which additional function could you use to create a new variable based on one or multiple conditional statements? `r mcq(c(x = "arrange()", x = "filter()", x = "case_match()", answer = "case_when()"))`
#### Question 3 {.unnumbered}
Which of the following functions would you use if you wanted to join two data sets by their shared identifier? `r mcq(c(answer = "inner_join()", x = "left_join()", x = "right_join()", x = "full_join()"))`
#### Question 4 {.unnumbered}
Your data object contains a column `Score` with numbers, but they have been read in incorrectly as a character datatype. Which of the following functions would *not* work for fixing this issue? `r mcq(c(x = "parse_number()", answer = "factor(Score)", x = "mutate(Score = as.numeric(Score))", x = "as.numeric()"))`
::: {.callout-caution collapse="true" icon="false"}
## Explain this answer
* `parse_number()` from the `readr` package extracts numeric values from strings, so this would work.
* `factor(Score)`: This would *not* work as expected because it converts the column into a factor, not a numeric datatype, leading to incorrect results if numeric operations are needed.
* `mutate(Score = as.numeric(Score))`: This would work too because `mutate()` can be used in combination with `as.numeric()` to create a new numeric column or override the existing character column.
* `as.numeric()`: This would also work to convert a character column to numeric. Without mutate, you could use it in a BaseR way, e.g., `data$Score <- as.numeric(data$Score)` (*shudder, BaseR!!! But effective*)
:::
### Challenge yourself {.unnumbered}
If you want to **challenge yourself** and further apply the skills from Chapter 3, you could wrangle the data from `dog_data_raw` for one of the other questionnaires. There are plenty of options to choose from:
::: {.callout-tip collapse="true" icon="false"}
## Difficulty level: easy
* recode column `Live_Pets` so the values read yes and no rather than 1 and 2
* recode `Year_of_Study` so they have the labels from the codebook rather than the numbers
* reverse-code the `Homesickness` scale for `_pre` and `_post`
* renaming the columns of the other one-item scales as `Stress_pre`, `Stress_post`, `Engagement_pre` and `Engagement_post`
Any of these tasks should be doable in one step. No need to select or pivot anything. You could just modify `dog_data_raw`.
::: {.callout-note collapse="true" icon="false"}
## Hints
* For the **recoding tasks**, you need to work out which function to use to recode one value as another - just plain replacing, no conditional statements
* The **reverse-coding** might sound daunting to do in one step, but it is only a single value that needs to be recoded. Take some inspiration from Activity 5 (error mode).
* For the **renaming tasks**, check how you would change column names without reducing the number of columns overall
:::
::: {.callout-caution collapse="true" icon="false"}
## Solution for **Challenge yourself - easy**
```{r eval=FALSE}
## Live_Pets
dog_data_raw <- dog_data_raw %>%
mutate(Live_Pets = case_match(Live_Pets,
1 ~ "yes",
2 ~ "no"))
```
```{r eval=FALSE}
## Year of Study
dog_data_raw <- dog_data_raw %>%
mutate(Year_of_Study = case_match(Year_of_Study,
1 ~ "First",
2 ~ "Second",
3 ~ "Third",
4 ~ "Fourth",
5 ~ "Fifth or above"))
```
```{r eval=FALSE}
## Reverse-coding of homesickness pre and post. It's a 5-point scale, hence you'd calculate 6-the original response column
dog_data_raw <- dog_data_raw %>%
mutate(Homesick_pre = 6-HO1_1,
Homesick_post = 6-HO2_1)
```
```{r eval=FALSE}
## Renaming of Stress and Engagement
dog_data_raw <- dog_data_raw %>%
rename(Stress_pre = S1_1, Stress_post = S2_1, Engagement_pre = HO1_2, Engagement_post = HO2_2)
```
:::
:::
::: {.callout-warning collapse="true" icon="false"}
## Difficulty level: medium
* reverse-code the Social connectedness scale (pre-intervention) and compute a mean score per participant
::: {.callout-note collapse="true" icon="false"}
## Hints
This task would take 4 steps to complete. These are the exact same steps we applied to `Loneliness_pre` in the lab activity. You would just need to figure out which items are related to the Social connectedness scale (pre-intervention) and which ones of those are reverse-coded. The codebook has all the answers.
:::
::: {.callout-caution collapse="true" icon="false"}
## Solution for **Challenge yourself - medium**
```{r eval=FALSE}
## SCS pre
scs_pre <- dog_data_raw %>%
select(RID, starts_with("SC1")) %>%
pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>%
mutate(Score_corrected = case_when(
Names %in% c("SC1_3", "SC1_6", "SC1_7", "SC1_9", "SC1_11", "SC1_13", "SC1_15", "SC1_17", "SC1_18", "SC1_20") ~ 7-Response,
.default = Response
)) %>%
group_by(RID) %>%
summarise(SCS_pre = mean(Score_corrected, na.rm = TRUE)) %>%
ungroup()
```
:::
:::
::: {.callout-caution collapse="true" icon="false"}
## Difficulty level: hard
* reverse-code the Loneliness scale (post-intervention) and compute a mean score per participant
* reverse-code the Social connectedness scale (post-intervention) and compute a mean score per participant
Both activities are similar to Activity 3 from the individual walkthrough and would take about 5 steps to complete. **Start by mapping out the steps**.
::: {.callout-note collapse="true" icon="false"}
## Hints
* **Step 1**: Select all relevant columns, such as participant ID and all the items that belong to the questionnaire that participants completed after the intervention
* **Step 2**: Pivot the data from wide format to long format so we can reverse-score and calculate the average score more easily
* **Step 3**: Recode the initial responses so that the new column has numbers instead of labels