Skip to content

Commit

Permalink
Chp 9 crossrefs and pipe update
Browse files Browse the repository at this point in the history
  • Loading branch information
mine-cetinkaya-rundel committed Sep 23, 2023
1 parent 2522dae commit 8d91da2
Show file tree
Hide file tree
Showing 5 changed files with 375 additions and 323 deletions.
151 changes: 84 additions & 67 deletions 09-model-logistic.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -36,56 +36,59 @@ The [`resume`](http://openintrostat.github.io/openintro/reference/resume.html) d

The first names that were used and randomly assigned in this experiment were selected so that they would predominantly be recognized as belonging to Black or White individuals; other races were not considered in this study.
While no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass this survey check were excluded from usage in the experiment.
You can find the full set of names that did pass the survey test and were ultimately used in the study in Table \@ref(tab:resume-names).
You can find the full set of names that did pass the survey test and were ultimately used in the study in \@tbl-resume-names.
For example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted to be associated with a White male.

```{r}
#| label: resume-names
resume_names_full <- resume %>%
select(firstname, race, gender) %>%
distinct(firstname, .keep_all = TRUE) %>%
arrange(firstname) %>%
rownames_to_column() %>%
#| label: tbl-resume-names
#| tbl-cap: |
#| List of all 36 unique names along with the commonly inferred
#| race and sex associated with these names.
resume_names_full <- resume |>
select(firstname, race, gender) |>
distinct(firstname, .keep_all = TRUE) |>
arrange(firstname) |>
rownames_to_column() |>
mutate(
rowname = as.numeric(rowname),
column = cut(rowname, breaks = c(0, 12, 24, 36)),
race = str_to_title(race),
sex = if_else(gender == "f", "female", "male"),
column = as.numeric(column)
) %>%
select(-rowname, -gender) %>%
) |>
select(-rowname, -gender) |>
relocate(column)
resume_names_1 <- resume_names_full %>%
filter(column == 1) %>%
resume_names_1 <- resume_names_full |>
filter(column == 1) |>
select(-column)
resume_names_2 <- resume_names_full %>%
filter(column == 2) %>%
resume_names_2 <- resume_names_full |>
filter(column == 2) |>
select(-column)
resume_names_3 <- resume_names_full %>%
filter(column == 3) %>%
resume_names_3 <- resume_names_full |>
filter(column == 3) |>
select(-column)
resume_names_1 %>%
bind_cols(resume_names_2) %>%
bind_cols(resume_names_3) %>%
resume_names_1 |>
bind_cols(resume_names_2) |>
bind_cols(resume_names_3) |>
kbl(linesep = "", booktabs = TRUE, align = "lllllllll",
col.names = c("first_name", "race", "sex",
"first_name", "race", "sex",
"first_name", "race", "sex"),
caption = "List of all 36 unique names along with the commonly inferred race and sex associated with these names.") %>%
"first_name", "race", "sex")) |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"), full_width = FALSE) %>%
column_spec(4, border_left = T) %>%
latex_options = c("striped", "hold_position"), full_width = FALSE) |>
column_spec(4, border_left = T) |>
column_spec(7, border_left = T)
```

```{r}
#| label: resume-data-prep
resume <- resume %>%
rename(sex = gender) %>%
resume <- resume |>
rename(sex = gender) |>
mutate(
sex = if_else(sex == "m", "man", "woman"),
sex = fct_relevel(sex, "woman", "man"),
Expand All @@ -95,17 +98,23 @@ resume <- resume %>%
military = as.factor(military),
has_email_address = as.factor(has_email_address),
race = if_else(race == "black", "Black", "White")
) %>%
) |>
select(received_callback, job_city, college_degree, years_experience,
honors, military, has_email_address, race, sex)
```

The response variable of interest is whether there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we'll consider, with special interest in the race and sex variables.
Race and sex are protected classes in the United States, meaning they are not legally permitted factors for hiring or employment decisions.
The full set of attributes considered is provided in Table \@ref(tab:resume-variables).
The full set of attributes considered is provided in \@tbl-resume-variables.

```{r}
#| label: resume-variables
#| label: tbl-resume-variables
#| tbl-cap: |
#| Descriptions of nine variables from the `resume` dataset. Many
#| of the variables are indicator variables, meaning they take
#| the value 1 if the specified characteristic is present and
#| 0 otherwise.
resume_variables <- tribble(
~variable, ~description,
"received_callback", "Specifies whether the employer called the applicant following submission of the application for the job.",
Expand All @@ -119,11 +128,11 @@ resume_variables <- tribble(
"sex", "Sex of the applicant (limited to only and in this study), implied by the first name listed on the resume."
)
resume_variables %>%
kbl(linesep = "", booktabs = TRUE, caption = caption_helper("Descriptions of nine variables from the `resume` dataset. Many of the variables are indicator variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.")) %>%
resume_variables |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"), full_width = TRUE) %>%
column_spec(1, monospace = TRUE) %>%
latex_options = c("striped", "hold_position"), full_width = TRUE) |>
column_spec(1, monospace = TRUE) |>
column_spec(2, width = "30em")
```

Expand Down Expand Up @@ -171,7 +180,7 @@ $$
logit(p_i) = \log_{e}\left( \frac{p_i}{1-p_i} \right)
$$

The logit transformation is shown in Figure \@ref(fig:logit-transformation).
The logit transformation is shown in Figure \@fig-logit-transformation.
Below, we rewrite the equation relating $Y_i$ to its predictors using the logit transformation of $p_i$:

```{r}
Expand All @@ -184,9 +193,10 @@ $$
$$

```{r}
#| label: logit-transformation
#| fig.cap: Values of $p_i$ against values of $logit(p_i)$.
#| out.width: 100%
#| label: fig-logit-transformation
#| fig-cap: Values of $p_i$ against values of $logit(p_i)$.
#| out-width: 100%
logit_df_line <- tibble(
p = seq(0.0001, 0.9999, 0.0002),
lp = log(p / (1 - p))
Expand All @@ -195,7 +205,7 @@ logit_df_line <- tibble(
logit_df_point <- tibble(
lp = -5:6,
p = round(exp(lp) / (exp(lp) + 1), 3)
) %>%
) |>
mutate(
label = glue::glue("({lp}, {p})"),
label_pos = case_when(
Expand All @@ -216,13 +226,13 @@ ggplot(logit_df_line, aes(x = lp, y = p)) +
) +
geom_point(data = logit_df_point, shape = "circle open",
size = 3, color = IMSCOL["red", "full"], stroke = 2) +
geom_text(data = logit_df_point %>% filter(label_pos == "above"),
geom_text(data = logit_df_point |> filter(label_pos == "above"),
aes(label = label), vjust = -2) +
geom_text(data = logit_df_point %>% filter(label_pos == "below"),
geom_text(data = logit_df_point |> filter(label_pos == "below"),
aes(label = label), vjust = 2.5) +
geom_text(data = logit_df_point %>% filter(label_pos == "right"),
geom_text(data = logit_df_point |> filter(label_pos == "right"),
aes(label = label), hjust = -0.25) +
geom_text(data = logit_df_point %>% filter(label_pos == "left"),
geom_text(data = logit_df_point |> filter(label_pos == "left"),
aes(label = label), hjust = 1.25) +
labs(
x = expression(logit(p[i])),
Expand Down Expand Up @@ -263,31 +273,34 @@ b. What would the probability be if the resume did list some honors?
------------------------------------------------------------------------

a. If a randomly chosen resume from those sent out is considered, and it does not list honors, then `honors` takes the value of 0 and the right side of the model equation equals -2.4998. Solving for $p_i$: $\frac{e^{-2.4998}}{1 + e^{-2.4998}} = 0.076$. Just as we labeled a fitted value of $y_i$ with a "hat" in single-variable and multiple regression, we do the same for this probability: $\hat{p}_i = 0.076{}$.
b. If the resume had listed some honors, then the right side of the model equation is $-2.4998 + 0.8668 \times 1 = -1.6330$, which corresponds to a probability $\hat{p}_i = 0.163$. Notice that we could examine -2.4998 and -1.6330 in Figure \@ref(fig:logit-transformation) to estimate the probability before formally calculating the value.
b. If the resume had listed some honors, then the right side of the model equation is $-2.4998 + 0.8668 \times 1 = -1.6330$, which corresponds to a probability $\hat{p}_i = 0.163$. Notice that we could examine -2.4998 and -1.6330 in Figure \@fig-logit-transformation to estimate the probability before formally calculating the value.
:::

While knowing whether a resume listed honors provides some signal when predicting whether the employer would call, we would like to account for many different variables at once to understand how each of the different resume characteristics affected the chance of a callback.

## Logistic model with many variables

We used statistical software to fit the logistic regression model with all 8 predictors described in Table \@ref(tab:resume-variables).
Like multiple regression, the result may be presented in a summary table, which is shown in Table \@ref(tab:resume-full-fit).
We used statistical software to fit the logistic regression model with all 8 predictors described in \@tbl-resume-variables.
Like multiple regression, the result may be presented in a summary table, which is shown in \@tbl-resume-full-fit.

```{r}
#| label: resume-full-fit
resume_full_fit <- logistic_reg() %>%
set_engine("glm") %>%
#| tbl-cap: |
#| Summary table for the full logistic regression model for the
#| resume callback example.
resume_full_fit <- logistic_reg() |>
set_engine("glm") |>
fit(received_callback ~ job_city + college_degree + years_experience + honors + military + has_email_address + race + sex, data = resume, family = "binomial")
resume_full_fit %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
resume_full_fit |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(linesep = "", booktabs = TRUE,
caption = "Summary table for the full logistic regression model for the resume callback example.",
digits = 2, align = "lrrrrr") %>%
digits = 2, align = "lrrrrr") |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position")) %>%
column_spec(1, width = "15em", monospace = TRUE) %>%
latex_options = c("striped", "hold_position")) |>
column_spec(1, width = "15em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```

Expand All @@ -301,28 +314,32 @@ When using AIC for model selection, models with a lower AIC value are considered
It is important to note that AIC provides information about the quality of a model relative to other models, but does not provide information about the overall quality of a model.

We will look for models with a lower AIC using a backward elimination strategy.
After using this criteria, the variable `college_degree` is eliminated, giving the smaller model summarized in Table \@ref(tab:resume-fit), which is what we'll rely on for the remainder of this section.
After using this criteria, the variable `college_degree` is eliminated, giving the smaller model summarized in \@tbl-resume-fit, which is what we'll rely on for the remainder of this section.

```{r}
#| include: false
terms_chp_09 <- c(terms_chp_09, "Akaike information criterion")
```

```{r}
#| label: resume-fit
resume_fit <- logistic_reg() %>%
set_engine("glm") %>%
#| label: tbl-resume-fit
#| tbl-cap: |
#| Summary table for the logistic regression model for the resume
#| callback example, where variable selection has been performed
#| using AIC.
resume_fit <- logistic_reg() |>
set_engine("glm") |>
fit(received_callback ~ job_city + years_experience + honors + military + has_email_address + race + sex, data = resume, family = "binomial")
resume_fit %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
resume_fit |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(linesep = "", booktabs = TRUE,
caption = "Summary table for the logistic regression model for the resume callback example, where variable selection has been performed using AIC.",
digits = 2, align = "lrrrrr") %>%
digits = 2, align = "lrrrrr") |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position")) %>%
column_spec(1, width = "17em", monospace = TRUE) %>%
latex_options = c("striped", "hold_position")) |>
column_spec(1, width = "17em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```

Expand All @@ -337,12 +354,12 @@ This positive coefficient reflects a positive gain in callback rate for resumes
The model results suggest that prospective employers favor resumes where the first name is typically interpreted to be White.
:::

The coefficient of $\texttt{race}_{\texttt{White}}$ in the full model in Table \@ref(tab:resume-full-fit), is nearly identical to the model shown in Table \@ref(tab:resume-fit).
The coefficient of $\texttt{race}_{\texttt{White}}$ in the full model in \@tbl-resume-full-fit, is nearly identical to the model shown in \@tbl-resume-fit.
The predictors in this experiment were thoughtfully laid out so that the coefficient estimates would typically not be much influenced by which other predictors were in the model, which aligned with the motivation of the study to tease out which effects were important to getting a callback.
In most observational data, it's common for point estimates to change a little, and sometimes a lot, depending on which other variables are included in the model.

::: {.workedexample data-latex=""}
Use the model summarized in Table \@ref(tab:resume-fit) to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male.
Use the model summarized in \@tbl-resume-fit to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male.

------------------------------------------------------------------------

Expand All @@ -351,8 +368,8 @@ We can start by writing out the equation using the coefficients from the model:
$$
\begin{aligned}
&log_e \left(\frac{p}{1 - p}\right) \\
&= - 2.7162 - 0.4364 \times \texttt{job_city}_{\texttt{Chicago}} \\
& \quad \quad + 0.0206 \times \texttt{years_experience} \\
&= - 2.7162 - 0.4364 \times \texttt{job\_city}_{\texttt{Chicago}} \\
& \quad \quad + 0.0206 \times \texttt{years\_experience} \\
& \quad \quad + 0.7634 \times \texttt{honors} - 0.3443 \times \texttt{military} + 0.2221 \times \texttt{email} \\
& \quad \quad + 0.4429 \times \texttt{race}_{\texttt{White}} - 0.1959 \times \texttt{sex}_{\texttt{man}}
\end{aligned}
Expand Down
4 changes: 2 additions & 2 deletions _freeze/09-model-logistic/execute-results/html.json

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified _freeze/09-model-logistic/figure-html/unnamed-chunk-14-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 8d91da2

Please sign in to comment.