03_categorical_continuous.qmd

# Categorical features

Categorical data are a common occurrence in regression modeling. These are any data that consist of categories rather than numbers. For instance, in a forestry experiment, you may thin the trees in some plots and do a prescribed burn in others. "Prescribed burn" and "thinning" aren't numerical values, they are categories of treatment.

Sometimes categories have a natural order, like low, medium, high, which blurs
the distinction between categorical and numerical data. We'll also cover ways
to handle these.

```{r hidden-load}
#| echo: false
#| include: false

# import packages and get a small version of the penguins data
library(palmerpenguins)
library(ggplot2)
library(dplyr)

data(penguins)
```


## Factors

In R, categorical variables are called **factors**. Deep down in the machinery
of a regression model, factor effects are handled the same way as continuous
features. To fully appreciate this, you have to understand that factors are
coded differently than continuous features. To begin, note that linear
regression for factor variables is also a kind of scatterplot smoother. Let's
look at an example:

### Penguin body mass

Returning to the `palmerpenguins` dataset, the plot below shows the mass of penguins of three different species.

```{r penguins-spp-example}
#| warning: false
#| message: false
#| code-fold: true
ggplot(penguins) +
  aes(x=species, y=body_mass_g) +
  geom_point() +
  ggtitle("Body mass of penguins by species") +
  ylab("Mass (g)")
```

Look carefully at the x-axis and you'll see that the coordinates are species names, not numbers. A line drawn to fit the points would imply that there is an order to the species, and that there are some intermediate values between the species.

```{r penguin-factor-example-smoothed}
#| warning: false
#| message: false
#| code-fold: true
ggplot(penguins) +
  aes(x=species, y=body_mass_g) +
  geom_point() +
  geom_abline(intercept=1800, slope=1200, color="blue") +
  ggtitle("Body mass of penguins by species with linear fit") +
  ylab("Mass (g)")
```

Clearly, we need to treat the species as categories, rather than as coordinates along a continuum. Looking at the summary of a regression model gives a clue as to how that works.

```{r penguin-mass_model}
penguin_lm = lm(body_mass_g ~ species, data=penguins)
summary(penguin_lm)
```

Now let's see the mean masses of the three species.

```{r penguin-species-mass}
group_by(penguins, species) |>
  summarize(mass = mean(body_mass_g, na.rm=TRUE))
```

Notice that the fitted value of the `(Intercept)` is identical to the average mass of an Adelie penguin, and that there are two rows of species effects, instead of the one row that we saw for the continuous effects so far. The values of the `Estimate` column in those rows are the estimated effects for Chinstrap and Gentoo penguins - and they are equal to the difference between the average mass of those penguins and Adelie penguins.

### Design matrix

The reason that the results look like this can be made more clear if you look at the way R converts the categories to numbers. The `model.matrix()` is the function that R uses internally to prepare data for an `lm()`, but we can call it ourselves. Remember that linear regression works by multiplying each term by a coefficient, adding adding the results together. Here, we have three terms for the three species. Adelie has been automatically selected as the reference level because it appears first in the data.

```{r penguin-model-matrix}
tail(model.matrix(penguin_lm))
```

Here, we have three terms: `(Intercept)`, `speciesChinstrap`, and
`speciesGentoo`. So to calculate the mass that is estimated for row 12, we will
add `1 * (Intercept) + 1 * speciesGentoo`. Since the effect `(Intercept)` is
the average mass of the Adelie penguins, the effect `speciesGentoo` must be the
difference between the mean mass of Gentoo penguins and the mean mass of Adelie
penguins.  Why no term for `speciesAdelie`? It's because the fit would then
depend on how mass was apportioned between the penguin species and the
intercept. You could increase the intercept by 10 grams and reduce all three
species estimates by 10 grams and end up with the same model fit, despite
different effects. The computer has no way of deciding between those options,
because it is just minimizing the model error.

You don't have to set one factor level to be the reference for estimation, but that goes beyond the scope of this introductory workshop. You may need to change which level is the reference, and that is within our scope. R has a function `relevel()`, which takes the argument `ref=`, which specifies the reference level of a factor. Here is how it would work to set Gentoo as the reference level:

```{r relevel-penguins}
penguins$species = relevel(penguins$species, ref="Gentoo")
releveled_penguin_model = lm(body_mass_g ~ species, data=penguins)
summary(releveled_penguin_model)
```

With Gentoo as the reference level for species, the summary of model results tells us directly that the body mass of Adelie and Chinstrap penguins are both significantly less than that of Gentoo penguins. However we set the factor levels, the model predictions remain unchanged, which helps emphasize that the factor coding affects which interpretation(s) are emphasized in the summary but does not change the model.

```{r model-comparison}
# compare predictions between the models with different contrasts:
pred_df = data.frame(species=c("Adelie", "Chinstrap", "Gentoo"))
predict(penguin_lm, pred_df)
predict(releveled_penguin_model, pred_df)
```


## Combining continuous and categorical

Of course, continuous and categorical features don't have to be kept separate.
We'll return to our original example and consider whether the relationship
between a penguin's mass and its flipper length is different between the three
species. You combine categorical and continuous features by adding them
together in the model formula. 

```{r no_interation_model}
# create a model with both species and flipper length as features
combo_model = lm(body_mass_g ~ flipper_length_mm + species,
     data=penguins)
```

Here is a visualization of the model fit:
```{r plot-no-interactions}
#| warning: false
#| message: false
#| code-fold: true
penguins$nointeract = predict(combo_model, newdata=penguins)
ggplot(penguins) +
  aes(x=flipper_length_mm, y=body_mass_g, color=species) +
  geom_point() +
  xlab("Flipper length (mm)") +
  ylab("Penguin mass (g)") +
  geom_smooth(data=penguins,
                mapping=aes(x=flipper_length_mm, y=nointeract, color=species),
                method='lm',
                se=FALSE)
```

As you can see, adding a categorical factor to the model has resulted in a vertical offset between the regression lines for each species, and the three regression lines all have the same slope. You should therefore interpret the species effects as species-specific intercepts for the regression line. Now, let's generate the summary plots and the model summary.


```{r analyze-no-interactions}
#| warning: false
#| message: false
# plot the model diagnostics
layout(matrix(1:4, 2, 2))
plot(combo_model)

# check out the model summary
summary(combo_model)
```

The summary shows how big the differences are between the species-specific intercepts. The diagnostic plots still exhibit a "U" shape in the Fitted Vs. Residual plot, so we haven't yet found an ideal regression model for this data.

## Interactions

We can improve this model further by adding an interaction between the species and the flipper length. An interaction allows the regression lines to have different slopes for the different species, as seen here:

```{r plot-interactions}
#| warning: false
#| message: false
#| code-fold: true
ggplot(penguins) +
  aes(x=flipper_length_mm, y=body_mass_g, color=species) +
  geom_point() +
  xlab("Flipper length (mm)") +
  ylab("Penguin mass (g)") +
  geom_smooth(method='lm', se=FALSE)
```

There will still be different intercepts between species because I have retained the so-called "main effect" of species. An intercept is written in an R formula by placing a colon (`:`) between two variables.


```{r interaction_model}
# create a model interacting species and flipper length as features
interaction_model =
  lm(body_mass_g ~ flipper_length_mm + species + flipper_length_mm:species,
     data=penguins)
```

And now we can look at the diagnostics.

```{r analyze-interaction-model}
#| warning: false
#| message: false
# plot the model diagnostics
layout(matrix(1:4, 2, 2))
plot(interaction_model)

# check out the model summary
summary(interaction_model)
```

The "U" shape in the residuals is gone! Also, the significantly negative coefficients for the interactions of flipper length with species Adelie and Chinstrap tells us that body mass for these species increases less quickly with flipper length than for Gentoo penguins. You can see this same relationship in the steeper slope of the red line in the scatter plot with interactions.