09-data-table.qmd

# data.table

The __data.table__ package is very powerful and the best package for dealing with large datasets in R. In this course we use the __tidyverse__ because it is easier for begginers. But if you plan to analyze large datasets in the future we highly recommend learning this package.

[This guide](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) is an excellent place to start.

[This blog post](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) is really useful for those that already know tidyverse.
Here we just show some examples.

## Manipulating data tables

`data.table` is a separate package that needs to be installed. Once installed, we then need to load it:

```{r, message=FALSE, warning=FALSE}
library(data.table)
```

We will provide example code showing the **data.table** approaches to **dplyr**'s `mutate`, `filter`, `select`, `group_by`, and `summarize` shown in the tidyverse chapter. As in that chapter, we will use the `murders` dataset:

```{r, echo=FALSE}
library(dslabs)
```

The first step when using **data.table** is to convert the data frame into a `data.table` object using the `as.data.table` function:

```{r}
murders_dt <- as.data.table(murders)
```

Without this initial step, most of the approaches shown below will not work.

### Selecting

Selecting with **data.table** is done in a similar way to subsetting matrices. While with **dplyr** we write:

```{r, eval=FALSE}
select(murders, state, region)
```

In **data.table**, we use notation similar to what is used with matrices:

```{r}
murders_dt[, c("state", "region")] |> head()
```

We can also use the `.()` **data.table** notation to alert R that variables inside the parenthesis are column names, not objects in the R environment. So the above can also be written like this:

```{r}
murders_dt[, .(state, region)] |> head()
```

### Adding a column or changing columns

We learned to use the **dplyr** `mutate` function with this example:

```{r, eval=FALSE}
murders <- mutate(murders, rate = total / population * 100000)
```

**data.table** uses an approach that avoids a new assignment (update by reference). This can help with large datasets that take up most of your computer's memory. The **data.table** :=\` function permits us to do this:

```{r, message=FALSE}
murders_dt[, rate := total / population * 100000]
```

This adds a new column, `rate`, to the table. Notice that, as in **dplyr**, we used `total` and `population` without quotes.

We can see that the new column is added:

```{r}
head(murders_dt)
```

To define new multiple columns, we can use the `:=` function with multiple arguments:

```{r, message=FALSE}
murders_dt[, ":="(rate = total / population * 100000, rank = rank(population))]
```

### Technical detail: reference versus copy

The **data.table** package is designed to avoid wasting memory. So if you make a copy of a table, like this:

```{r}
x <- data.table(a = 1)
y <- x
```

`y` is actually referencing `x`, it is not an new opject: it's just another name for `x`. Until you change `y`, a new object will not be made. However, the `:=` function changes *by reference* so if you change `x`, a new object is not made and `y` continues to be just another name for `x`:

```{r}
x[,a := 2]
y
```

You can also change `x` like this:

```{r}
y[,a := 1]
x
```

To avoid this, you can use the `copy` function which forces the creation of an actual copy:

```{r}
x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y
```

Note that the function `as.data.table` creates a copy of the data frame being converted. However, if working with a large data frames it is helpful to avoid this, and you can do this by using `setDT`.

```{r}
x <- data.frame(a = 1)
setDT(x)
```

However, note that because no copy is being made, be aware that the following code does not create a new object:

```{r}
x <- data.frame(a = 1)
y <- setDT(x)
```

The objects `x` and `y` are referencing the same data table:

```{r}
x[,a := 2]
y
```


### Subsetting

With **dplyr**, we filtered like this:

```{r, eval=FALSE}
filter(murders, rate <= 0.7)
```

With **data.table**, we again use an approach similar to subsetting matrices, except **data.table** knows that `rate` refers to a column name and not an object in the R environment:

```{r}
murders_dt[rate <= 0.7]
```

Notice that we can combine the filter and select into one succint command. Here are the state names and rates for those with rates below 0.7.

```{r}
murders_dt[rate <= 0.7, .(state, rate)]
```

Compare to the **dplyr** approach:

```{r, eval=FALSE}
murders |> filter(rate <= 0.7) |> select(state, rate)
```


## Summarizing data

As an example, we will use the `heights` dataset:

```{r}
library(dplyr)
library(dslabs)
heights_dt <- as.data.table(heights)
```

In **data.table**, we can call functions inside `.()` and they will be applied to rows. So the equivalent of:

```{r}
s <- heights |> 
  summarize(average = mean(height), standard_deviation = sd(height))
```

in **dplyr** is the following in **data.table**:

```{r}
s <- heights_dt[, .(average = mean(height), standard_deviation = sd(height))]
```

Note that this permits a compact way of subsetting and then summarizing. Instead of:

```{r, eval=FALSE}
s <- heights |> 
  filter(sex == "Female") |>
  summarize(average = mean(height), standard_deviation = sd(height))
```

we can write:

```{r}
s <- heights_dt[sex == "Female", .(average = mean(height), standard_deviation = sd(height))]
```

### Multiple summaries

We previously defined the function:

```{r}
median_min_max <- function(x){
  qs <- quantile(x, c(0.5, 0, 1))
  data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}
```

Similar to **dplyr**, we can call this function within `.()` to obtain the three number summary:

```{r}
heights_dt[, .(median_min_max(height))]
```

### Group then summarize

The `group_by` followed by `summarize` in **dplyr** is performed in one line in **data.table**. We simply add the `by` argument to split the data into groups based on the values in categorical variable:

```{r}
heights_dt[, .(average = mean(height), standard_deviation = sd(height)), by = sex]
```

## Sorting data frames

We can order rows using the same approach we use for filter. Here are the states ordered by murder rate:

```{r}
murders_dt[order(population)] |> head()
```

N To sort the table in descending order, we can order by the negative of `population` or use the `decreasing` argument:

```{r, eval=FALSE}
murders_dt[order(population, decreasing = TRUE)] 
```

### Nested sorting

Similarly, we can perform nested ordering by including more than one variable in order

```{r, eval=FALSE}
murders_dt[order(region, rate)] 
```


## Optional exercises (will not be included in the midterms)

(@) Load the **data.table** package and the `murders` dataset and convert it to `data.table` object:

```{r, eval=FALSE}
library(data.table)
library(dslabs)
murders_dt <- as.data.table(murders)
```

Remember you can add columns like this:

```{r, eval=FALSE}
murders_dt[, population_in_millions := population / 10^6]
```

Add a `murders` column named `rate` with the per 100,000 murder rate as in the example code above.

(@) Add a column `rank` containing the rank, from highest to lowest murder rate.

(@) If we want to only show the states and population sizes, we can use:

```{r, eval=FALSE}
murders_dt[, .(state, population)] 
```

Show the state names and abbreviations in `murders`.

(@) You can show just the New York row like this:

```{r, eval=FALSE}
murders_dt[state == "New York"]
```

You can use other logical vectors to filter rows.

Show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the `murders` dataset, just show the result. Remember that you can filter based on the `rank` column.

(@) We can remove rows using the `!=` operator. For example, to remove Florida, we would do this:

```{r, eval=FALSE}
no_florida <- murders_dt[state != "Florida"]
```

Create a new data frame called `no_south` that removes states from the South region. How many states are in this category? You can use the function `nrow` for this.

(@) We can also use `%in%` to filter. You can therefore see the data from New York and Texas as follows:

```{r, eval=FALSE}
murders_dt[state %in% c("New York", "Texas")]
```

Create a new data frame called `murders_nw` with only the states from the Northeast and the West. How many states are in this category?

(@) Suppose you want to live in the Northeast or West **and** want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with `filter`. Here is an example in which we filter to keep only small states in the Northeast region.

```{r, eval=FALSE}
murders_dt[population < 5000000 & region == "Northeast"]
```

Make sure `murders` has been defined with `rate` and `rank` and still has all states. Create a table called `my_states` that contains rows for states satisfying both the conditions: they are in the Northeast or West and the murder rate is less than 1. Show only the state name, the rate, and the rank.