-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_tidyverse.Rmd
667 lines (487 loc) · 21 KB
/
02_tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
---
title: "Intro to R 02: Tidyverse"
author: "Kyla McConnell"
output: html_document
---
### Warm-up
1. Assign the following variables:
```{r}
my_name <- "Kyla"
my_age <- 28
old <- my_age > 50
```
2. What are the appropriate data types for these variables? Check if they are saved as the type you would expect.
```{r}
class(my_age)
class(my_name)
class(old)
```
3. Assign the friends variable a vector including some friends names. Before running the second line of this code, anticipate what you think it will produce. Were you right?
```{r}
friends <- c("Jennie", "Max")
nchar(friends)
```
4. Imagine you're running a lemonade stand with two friends. The array "total_sales" contains your daily sales amounts for every day in the past 5 days. Divide the totals by 3 and see how much each person takes home per day.
```{r}
total_sales <- c(26, 43, 32, 47, 12)
total_sales / 3
```
5. From your total_sales array, extract the total sales on day 3 using an index (Hint: [])
```{r}
total_sales[5]
```
### Installing packages
- From the console with `install.packages("-PACKAGENAME-")` or through the Packages pane
- Calling packages with `library()` near the top of the script
1. Try the following command:
```{r}
beep(2)
```
What error message do you get?
Now install the package "beepr" and call the package to this session with a library call somewhere above the beep command (line above is enough.) Try running the command again (make sure your speakers are on -- but not too loud :D)
2. Install and -load- the "cowsay" package and try the following command:
```{r}
library(cowsay)
say(what = "Good luck learning R!", by = "chicken")
```
# Welcome to the tidyverse
Welcome to the tidyverse! Tidyverse is a package -- really, a whole collection of packages -- that include a LOT of useful functions for all sorts of data analyzing, cleaning and "wrangling".
To start, make sure you have tidyverse installed. You can do this either through the panel at the lower left hand corner (Packages tab) or by typing into the Console `install.packages("tidyverse")`. Then, call the tidyverse with a `library()` call and let's get started!
```{r}
library(tidyverse)
```
Tidyverse loads multiple key packages:
- dplyr -> for all sorts of data transformation and wrangling
- ggplot2 -> the best plotting package in R (and ever??)
- readr & tibble -> for reading in files to tibble format (improvements over R base data.frames)
- stringr -> for text transformations (removing trailing whitespaces, etc.)
- magrittr -> for pipes, more on this below
It also loads purr (for vectorized programming), forcats (improvements to factors), readxl (for reading Excel documents), lubridate (for working with dates/times) and more.
### The tibble
We already know one command set from the tidyverse: `read_csv()` and `read_tsv()`. These are used to read in data, and they return a structure called the "tibble".
The tibble is the "tidyverse" version of the base-R alternative, the "data.frame".
In practice, "data.frame"s and "tibble"s are very similar, however:
- tibbles print a few rows as default
- tibbles are made to work with the tidyverse calls to come
- tibbles don't automatically read in text columns as factors (sounds like a disadvantage but is often an advantage, more on this later)
- tibbles doesn't restrict or change column names -- if they are invalid (have spaces, start with whitespace), they can still be called, you just have to surround them with backticks ``
That's probably enough info for now, but if you want to read more about tibbles as you get more comfortable with R, here is a chapter in R for Data Science:
https://r4ds.had.co.nz/tibbles.html
Load in the following dataset, which contains IKEA furniture items in Saudi Arabia and their prices (in Saudi Riyals)
```{r}
ikea <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv')
```
See here that you can also load in dataframes from links!
This dataset has some NAs so I'll quickly remove them with `drop_na` so they don't cause problems later:
```{r}
ikea <- drop_na(ikea)
```
Note: I do this because it's a dataset I got online and I'm not doing any sophisticated analysis on it below. If this were your own data, you should think about what the NAs represent and if you can replace them in another way. `drop_na()` will drop the WHOLE ROW if there is an NA in any of the columns.
### The pipe %>%
One of the most noticeable features of the tidyverse is the pipe %>% (keyboard shortcut: Ctr/Cmd + Shift + M)
The pipe takes the item before it and feeds it to the following command as the first argument. Since all tidyverse (and some non-tidyverse) functions take the dataframe as the first function, this can be used to string together multiple functions.
First, take a look at the dataset. You can do this with `head(ikea)` or try out using the pipe.
```{r}
head(ikea)
ikea %>%
head()
```
You see that this produces the exact same output as `head(ikea)`. Why would this be useful?
Compare the following lines of pseudocode, which would produce the same output:
A.
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
B.
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe!
Here are some more examples:
```{r}
# Equivalent to summary(ikea)
ikea %>%
summary()
# Equivalent to colnames(ikea), which returns all column names
colnames(ikea)
ikea %>%
colnames()
# Equivalent to nrow(ikea), which returns the number of rows in the df
nrow(ikea)
ikea %>%
nrow()
```
You can also call further arguments to a function, together with the pipe
```{r}
# Equivalent to tail(ikea, n=2)
ikea %>%
tail(n = 2)
# Equivalent to head(ikea, n=10)
ikea %>%
head(n=10)
```
And you can also stack commands by ending each row (except the last one) with a pipe:
```{r}
ikea %>%
colnames() %>% #extracts column names
toupper() #converts them to uppercase
ikea %>%
colnames() %>% #extracts column names
nchar() #counts the number of letters
```
### Select one column
We learned before that the tradition syntax for dealing with columns is dataframe$column.
A useful step in using pipes and tidyverse calls is the ability to *select* specific columns. That is, instead of writing `ikea$category` we can write:
```{r}
ikea %>%
select(category)
```
We can then use this column for further calculations, like piping it on to the summary call. This will provide the same result as `summary(ikea$price)`
```{r}
ikea %>%
select(price) %>%
summary()
ikea %>%
select(price) %>%
max()
ikea %>%
select(height) %>%
min()
```
We could, for example, look at the tallest item:
```{r}
ikea %>%
select(height) %>%
max()
```
Or the thinnest:
```{r}
ikea %>%
select(width) %>%
min()
```
### Select multiple columns
You can also use `select()` to take multiple columns.
```{r}
ikea %>%
select(name, price, category)
```
You can see that these columns are presented in the order you gave them to the select call, too:
```{r}
ikea %>%
select(category, price, name)
```
You can also use this to reorder columns in that you give the name of the column(s) you want first, then finish with `everything()` to have all other columns follow:
```{r}
ikea %>%
select(price, everything())
```
You can also pipe these multiple columns into a call, only if it is one that can work with multiple columns.
```{r}
ikea %>%
select(depth, width) %>%
max() #Returns the largest number in either of the selected columns
```
#### Try it out
1. Load in the spr_example dataset. You may have to update the path to match where you have saved the file -- if it's in the same folder, it should work with just the simple read-in below.
```{r}
spr <- read_csv("data/spr_example.csv")
```
2. Pipe the 'spr' dataset variable to `head()` and take a look at the first six lines. Try also `colnames()` What are the column names?
```{r}
spr %>%
head()
head(spr)
spr %>%
colnames()
```
3. Start with the 'spr' dataset, select the 'RT' column and pipe it to the `max()` command. What is the maximum response time?
```{r}
spr %>%
select(RT) %>%
min()
max(spr$RT)
mean(spr$RT)
```
4. Again select the 'RT' column and pipe it to the `range()` column. What is the range of response times?
```{r}
spr %>%
select(RT) %>%
range()
```
### Preview vs. saving
Above, we have used `select()` to look at the output in the output pane of R-Markdown. However, if you look at the ikea dataframe, for example in the Environment panel on the upper-right, the dataframe hasn't changed.
However, you can also save your changes by assigning your call back to the variable name.
For example, if we wanted to put the name column before all columns, we could call `select(name, everything())` where everything() automatically fills in the names of all other columns, so that we don't have to type them.
```{r}
ikea %>%
select(name, everything())
```
If we are happy with this change, we can save it by assigning the output back to the variable "ikea":
```{r}
ikea <- ikea %>%
select(name, everything())
```
There is no output here, but the ikea dataframe has been permanently updated (within the R session, not in your file system)
There is also a trick to show a preview and save it to the variable -- wrapping the whole call in parentheses. However, the best workflow is usually to try out your call without saving, check out the preview, then re-run the call but add the saving to the variable when you are happy with the output.
```{r}
(ikea <- ikea %>%
select(name, everything()))
```
If you make a mistake, you can always go back to where you've read in the code and start over. The arrow with a line under it in the code block of R-Markdown is really handy for this, because it won't run the current block -- thus resetting the slate for you to re-do your most recent code.
### Remove columns with select
You can also remove columns using select if you use the minus sign. For example, here, we have the first column ("X1") which is a sort of row numbering. If you don't want this column, you can drop it with select:
```{r}
ikea %>%
select(-X1)
```
You can also remove multiple columns at once by writing them in an array `c()`. Once your preview looks the way you want it --just make sure to save your results by committing it to a variable (over the old one is fine).
```{r}
ikea <- ikea %>%
select(-c(X1, old_price))
```
Remember that you also automatically drop any columns not called directly in a `select()` call.
### Rename columns
You can rename columns with the `rename()` function. The syntax is new_name = old_name.
```{r}
ikea %>%
rename(price_sar = price)
```
You can also rename multiple columns at once (no need for an array here):
```{r}
ikea <- ikea %>%
rename(price_sar = price,
description = short_description)
```
Notice above that I've saved output over the ikea dataframe to make the changes 'permanent'.
### Arrange items in a column
Let's say we want to sort the items by price quickly, to get an idea for the most and least expensive items. For this, we can use `arrange()`
By default, this shows the items from lowest to highest:
```{r}
ikea %>%
arrange(price_sar)
```
Or in alphabetical order if the column is a character type:
```{r}
ikea %>%
arrange(name)
```
But you can also arrange the prices from highest to lowest by wrapping the column name in `desc()`
```{r}
ikea %>%
arrange(desc(price_sar))
```
```{r}
ikea %>%
arrange(desc(name))
```
```{r}
ikea %>%
arrange(height, width, depth)
```
#### Try it out
1. From the spr dataset, drop the "item_type" column and save the results back to the variable.
```{r}
spr <- spr %>%
select(-item_type)
```
2. Rename the column "full_sentence" in your spr dataframe as "sentence" (make sure to save this too.)
```{r}
spr <- spr %>%
rename(sentence = full_sentence)
```
3. Arrange the RT column from longest to shortest response time. (Don't have to save this part, just take a look.)
```{r}
spr %>%
arrange(desc(RT))
```
### Add new columns with mutate
Let's try to make a column where the prices are in euro. We can call this column price_eur. When I checked, 1 Saudi Riyal was equal to 0.22 euro. So we can multiply the price_sar column by 0.22
```{r}
ikea %>%
mutate(threes = 3)
ikea <- ikea %>%
mutate(price_eur = price_sar * 0.22)
```
Now, there's a new column called price_eur (it's at the very end by default)
You can also save the new column with the same name, and this will update all the items in that column (see below, where I add 100 to every price, but note that I don't save the output)
```{r}
ikea %>%
mutate(price_sar = price_sar + 100)
```
You can even create many new columns at the same time. Here, dropping each new column onto a new line is considered new style (but not necessary!)
For example, if we wanted to represent dimensions as meters instead of cm:
```{r}
ikea %>%
mutate(depth_m = depth/100,
height_m = height/100,
width_m = width/100)
ikea %>%
mutate(depth_m = depth/100) %>%
mutate(height_m = height/100)
```
You can also do operations to character columns; for example,
```{r}
ikea %>%
mutate(name = tolower(name))
```
### Change data type in a column
- Tibbles read in all text columns as character -- if you want something to be a factor, you have to set this explicitly.
- Factors represent groups (participant, conditions, etc.)
- When R acknowledges something as a factor, it can see that there are repeated groupings, i.e. that c("US", "US", "UK") has two instances of the grouping "US" and one of the grouping "UK".
- Let's see what happens when we use call summary on a character column:
```{r}
class(ikea$category)
ikea %>%
select(category) %>%
summary()
```
It tells us just that there are 3694 character rows.
In reality, category is a grouping that repeats itself. So let's set it to a factor (and save it!):
```{r}
ikea <- ikea %>%
mutate(category = as.factor(category))
ikea$category <- as.factor(ikea$category)
class(ikea$category)
```
Now call summary on the category column:
```{r}
ikea %>%
select(category) %>%
summary()
```
Now it tells us how many items of each category are present in the dataframe! Factors also have other benefits / allow R to work intuitively with categories -- so converting to factor will enable R to work with this type of column as it should.
Which other columns in this dataframe should be factors?
```{r}
ikea <- ikea %>%
mutate(other_colors <- as.factor(other_colors),
designer <- as.factor(designer))
ikea$designer <- as.factor(designer)
summary(ikea$designer)
```
Things like a description, or the item names (if they don't repeat much), don't have to be factors but are more intuitive as character columns, since they don't represent groupings.
Check out the changes in the environment panel too.
Some calls you can use on factor column include:
```{r}
levels(ikea$category) #Shows which unique groupings exist
levels(ikea$other_colors)
```
There is also another syntax that does the same job for replacing columns with altered copies. You may like this better -- either works fine!
ikea$category <- as.factor(ikea.category)
If any other columns in your dataframe are read in wrong (for example, if you have a numeric column that looks like: "43", "18" and is being read as a character column) you can convert them with similar syntax:
`as.numeric()`, `as.character()` etc.
#### Try it out
1. Call summary on the "participant" column of the spr dataframe -- you can do this either the "tidyverse way" with `select()` or with the traditional `$` syntax. What does it show?
```{r}
spr_data <- read_csv("../data/spr_example.csv")
```
2. Call `class()` on the same column to get the data type.
3. Now, try changing the column to a factor using `as.factor()`. Remember that you'll have to save this over top of the old variable.
- Change a character column into a factor using `as.factor()`
- save over the original column
4. What do `class()` and `summary()` return now?
5. Are there any other columns in the spr dataframe that you think should be represented as factors?
### Filter based on a condition
You can also select all items that fit a certain condition. For example, you can use:
- equals to: ==
- not equal to: !=
- greater than: >
- greater than or equal to: >=
- less than: <
- less than or equal to: <=
- in (i.e. in an array): %in%
```{r}
ikea %>%
filter(price_sar > 8500)
```
You can also select items with a specific price:
```{r}
ikea %>%
filter(price_sar == 265)
```
Or you can use it to select all items in a given category. Notice here that category is a character column, so you have to use quotation marks to show you're matching a character string.
Look at the error below:
```{r}
ikea %>%
filter(category == Beds)
```
The correct syntax is: (because you're matching to a string)
```{r}
ikea %>%
filter(category == "Beds")
```
To use %in%, give an array of options (formatted in the correct way based on whether the column is a character or numeric):
```{r}
ikea %>%
filter(name %in% c("BRIMNES", "BILLY", "KALLAX"))
```
Note that filter is case-sensitive, so capitalization matters.
#### Try it out
1. Reading times lower than 150ms are pretty unusual and are sometimes removed from self-paced reading data. You decide that you want to remove these response times from your data. Filter out all RTs below 150ms from the spr dataset and save your result.
2. You want to take a look at the results from Par_C only: Filter these data points but don't save the result.
3. You're not interested in the practice items in the spr dataframe. Filter the dataset to include everything BUT the practice items and save your result.
### Summary tables with groupings
To look at summary statistics for specific groupings, we have to use a two- (more like three-) step process.
First, group by your grouping variable using `group_by()`
Then, summarize, which creates a column based on a transformation to another column, using `summarize()` or `summarise()`
Finally, ungroup (so that R forgets that this is a grouping and carries on as normal, with `ungroup()`
- usually this makes no difference to what you see but is an important failsafe
You can use multiple different operations in the summarize part, including:
- mean(col_name)
- median(col_name)
- max(col_name)
- min(col_name)
For example, to return the average price in each category:
```{r}
ikea %>%
group_by(category) %>%
summarize(mean(price_eur)) %>%
ungroup()
```
You can also give a name to your new summary column, and round the output:
```{r}
ikea %>%
group_by(category) %>%
summarize(avg_price = round(mean(price_eur))) %>%
ungroup()
```
For categorical columns, you can also count how many rows are in each category using `count()` instead of `summarize()`
```{r}
ikea %>%
group_by(category) %>%
count() %>%
ungroup()
```
Similarly, you can use distint() to return the unique items in the category, for example:
```{r}
ikea %>%
group_by(category) %>%
distinct(name)
```
You can call `distinct()` without an argument, which will make sure that all columns have only unique items, or you can call it on a specific column, which will return only that column and all the unique items in it.
#### Try it out
1. Starting with the spr dataset, group by participant and use summarize to find the max RT for each participant.
2. With the same dataset, group by condition and return the mean RT for each condition. This is already a first step towards statistical analysis of your data!
3. Summarize based on condition and count how many full sentences are in each condition. (Here you will realize that this example dataset doesn't make a ton of sense :D)
### Adv: Making new columns with ifelse conditions
Now for something fancy. You can also make new columns based on "if" conditions using the call `ifelse()`. The syntax of ifelse is: ifelse(this_is_true, this_happens, else_this_happens). For example:
```{r}
ikea %>%
mutate(price_categorical = ifelse(price_sar > 2000, "expensive", "not expensive")) %>%
select(name, price_sar, price_categorical)
```
You can also use this to condense the groups in a certain column and then save it over that column. For example, say we want to combine the categorys "Beds", "Wardrobes", and "Chests of drawers & drawer units" into one category, "Bedroom":
```{r}
ikea %>%
mutate(category = ifelse(category %in% c("Beds", "Wardrobes","Chests of drawers & drawer units"), "Bedroom", category))
```
Notice here that if we want to affect the entry only if the condition is TRUE, leaving it unchanged if it is false, we leave the column name as the else condition. This tells R to take the value that is currently in that column if it finds the condition to be FALSE.
If-else logic (especially in a mutate statement) can be a bit tricky, but it is also extremely useful, so it is worth taking the time to get used to.
### Reading
For reading on these topics, check out Chapter 5 of R for Data Science, found here: https://r4ds.had.co.nz/transform.html
Note that the author hasn't introduced the pipe yet in the chapter, so the commands are shown without it, but they work both ways!