-
Notifications
You must be signed in to change notification settings - Fork 42
/
09-data-table.qmd
306 lines (201 loc) · 8.94 KB
/
09-data-table.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# data.table
The __data.table__ package is very powerful and the best package for dealing with large datasets in R. In this course we use the __tidyverse__ because it is easier for begginers. But if you plan to analyze large datasets in the future we highly recommend learning this package.
[This guide](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) is an excellent place to start.
[This blog post](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) is really useful for those that already know tidyverse.
Here we just show some examples.
## Manipulating data tables
`data.table` is a separate package that needs to be installed. Once installed, we then need to load it:
```{r, message=FALSE, warning=FALSE}
library(data.table)
```
We will provide example code showing the **data.table** approaches to **dplyr**'s `mutate`, `filter`, `select`, `group_by`, and `summarize` shown in the tidyverse chapter. As in that chapter, we will use the `murders` dataset:
```{r, echo=FALSE}
library(dslabs)
```
The first step when using **data.table** is to convert the data frame into a `data.table` object using the `as.data.table` function:
```{r}
murders_dt <- as.data.table(murders)
```
Without this initial step, most of the approaches shown below will not work.
### Selecting
Selecting with **data.table** is done in a similar way to subsetting matrices. While with **dplyr** we write:
```{r, eval=FALSE}
select(murders, state, region)
```
In **data.table**, we use notation similar to what is used with matrices:
```{r}
murders_dt[, c("state", "region")] |> head()
```
We can also use the `.()` **data.table** notation to alert R that variables inside the parenthesis are column names, not objects in the R environment. So the above can also be written like this:
```{r}
murders_dt[, .(state, region)] |> head()
```
### Adding a column or changing columns
We learned to use the **dplyr** `mutate` function with this example:
```{r, eval=FALSE}
murders <- mutate(murders, rate = total / population * 100000)
```
**data.table** uses an approach that avoids a new assignment (update by reference). This can help with large datasets that take up most of your computer's memory. The **data.table** :=\` function permits us to do this:
```{r, message=FALSE}
murders_dt[, rate := total / population * 100000]
```
This adds a new column, `rate`, to the table. Notice that, as in **dplyr**, we used `total` and `population` without quotes.
We can see that the new column is added:
```{r}
head(murders_dt)
```
To define new multiple columns, we can use the `:=` function with multiple arguments:
```{r, message=FALSE}
murders_dt[, ":="(rate = total / population * 100000, rank = rank(population))]
```
### Technical detail: reference versus copy
The **data.table** package is designed to avoid wasting memory. So if you make a copy of a table, like this:
```{r}
x <- data.table(a = 1)
y <- x
```
`y` is actually referencing `x`, it is not an new opject: it's just another name for `x`. Until you change `y`, a new object will not be made. However, the `:=` function changes *by reference* so if you change `x`, a new object is not made and `y` continues to be just another name for `x`:
```{r}
x[,a := 2]
y
```
You can also change `x` like this:
```{r}
y[,a := 1]
x
```
To avoid this, you can use the `copy` function which forces the creation of an actual copy:
```{r}
x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y
```
Note that the function `as.data.table` creates a copy of the data frame being converted. However, if working with a large data frames it is helpful to avoid this, and you can do this by using `setDT`.
```{r}
x <- data.frame(a = 1)
setDT(x)
```
However, note that because no copy is being made, be aware that the following code does not create a new object:
```{r}
x <- data.frame(a = 1)
y <- setDT(x)
```
The objects `x` and `y` are referencing the same data table:
```{r}
x[,a := 2]
y
```
### Subsetting
With **dplyr**, we filtered like this:
```{r, eval=FALSE}
filter(murders, rate <= 0.7)
```
With **data.table**, we again use an approach similar to subsetting matrices, except **data.table** knows that `rate` refers to a column name and not an object in the R environment:
```{r}
murders_dt[rate <= 0.7]
```
Notice that we can combine the filter and select into one succint command. Here are the state names and rates for those with rates below 0.7.
```{r}
murders_dt[rate <= 0.7, .(state, rate)]
```
Compare to the **dplyr** approach:
```{r, eval=FALSE}
murders |> filter(rate <= 0.7) |> select(state, rate)
```
## Summarizing data
As an example, we will use the `heights` dataset:
```{r}
library(dplyr)
library(dslabs)
heights_dt <- as.data.table(heights)
```
In **data.table**, we can call functions inside `.()` and they will be applied to rows. So the equivalent of:
```{r}
s <- heights |>
summarize(average = mean(height), standard_deviation = sd(height))
```
in **dplyr** is the following in **data.table**:
```{r}
s <- heights_dt[, .(average = mean(height), standard_deviation = sd(height))]
```
Note that this permits a compact way of subsetting and then summarizing. Instead of:
```{r, eval=FALSE}
s <- heights |>
filter(sex == "Female") |>
summarize(average = mean(height), standard_deviation = sd(height))
```
we can write:
```{r}
s <- heights_dt[sex == "Female", .(average = mean(height), standard_deviation = sd(height))]
```
### Multiple summaries
We previously defined the function:
```{r}
median_min_max <- function(x){
qs <- quantile(x, c(0.5, 0, 1))
data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}
```
Similar to **dplyr**, we can call this function within `.()` to obtain the three number summary:
```{r}
heights_dt[, .(median_min_max(height))]
```
### Group then summarize
The `group_by` followed by `summarize` in **dplyr** is performed in one line in **data.table**. We simply add the `by` argument to split the data into groups based on the values in categorical variable:
```{r}
heights_dt[, .(average = mean(height), standard_deviation = sd(height)), by = sex]
```
## Sorting data frames
We can order rows using the same approach we use for filter. Here are the states ordered by murder rate:
```{r}
murders_dt[order(population)] |> head()
```
N To sort the table in descending order, we can order by the negative of `population` or use the `decreasing` argument:
```{r, eval=FALSE}
murders_dt[order(population, decreasing = TRUE)]
```
### Nested sorting
Similarly, we can perform nested ordering by including more than one variable in order
```{r, eval=FALSE}
murders_dt[order(region, rate)]
```
## Optional exercises (will not be included in the midterms)
(@) Load the **data.table** package and the `murders` dataset and convert it to `data.table` object:
```{r, eval=FALSE}
library(data.table)
library(dslabs)
murders_dt <- as.data.table(murders)
```
Remember you can add columns like this:
```{r, eval=FALSE}
murders_dt[, population_in_millions := population / 10^6]
```
Add a `murders` column named `rate` with the per 100,000 murder rate as in the example code above.
(@) Add a column `rank` containing the rank, from highest to lowest murder rate.
(@) If we want to only show the states and population sizes, we can use:
```{r, eval=FALSE}
murders_dt[, .(state, population)]
```
Show the state names and abbreviations in `murders`.
(@) You can show just the New York row like this:
```{r, eval=FALSE}
murders_dt[state == "New York"]
```
You can use other logical vectors to filter rows.
Show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the `murders` dataset, just show the result. Remember that you can filter based on the `rank` column.
(@) We can remove rows using the `!=` operator. For example, to remove Florida, we would do this:
```{r, eval=FALSE}
no_florida <- murders_dt[state != "Florida"]
```
Create a new data frame called `no_south` that removes states from the South region. How many states are in this category? You can use the function `nrow` for this.
(@) We can also use `%in%` to filter. You can therefore see the data from New York and Texas as follows:
```{r, eval=FALSE}
murders_dt[state %in% c("New York", "Texas")]
```
Create a new data frame called `murders_nw` with only the states from the Northeast and the West. How many states are in this category?
(@) Suppose you want to live in the Northeast or West **and** want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with `filter`. Here is an example in which we filter to keep only small states in the Northeast region.
```{r, eval=FALSE}
murders_dt[population < 5000000 & region == "Northeast"]
```
Make sure `murders` has been defined with `rate` and `rank` and still has all states. Create a table called `my_states` that contains rows for states satisfying both the conditions: they are in the Northeast or West and the murder rate is less than 1. Show only the state name, the rate, and the rank.