-
Notifications
You must be signed in to change notification settings - Fork 173
/
model-applications.qmd
609 lines (495 loc) · 24.7 KB
/
model-applications.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
# Applications: Model {#sec-model-application}
```{r}
#| include: false
source("_common.R")
library(ggpubr)
```
## Case study: Houses for sale {#sec-model-case-study}
Take a walk around your neighborhood and you'll probably see a few houses for sale, and you might be able to look up its price online.
You'll note that house prices are somewhat arbitrary -- the homeowners get to decide the listing price, and many criteria factor into this decision, e.g., what do comparable houses ("comps" in real estate speak) sell for, how quickly they need to sell the house, etc.
In this case study we'll formalize the process of determining the listing price of a house by using data on current home sales.
In November of 2020, information on `r nrow(duke_forest)` houses in the Duke Forest neighborhood of Durham, NC were scraped from [Zillow](https://www.zillow.com).
The homes were all recently sold at the time of data collection, and the goal of the project was to build a model for predicting the sale price based on a particular home's characteristics.
The first four homes are shown in @tbl-duke-data-frame, and descriptions of each variable are shown in @tbl-duke-variables.
::: {.data data-latex=""}
The [`duke_forest`](http://openintrostat.github.io/openintro/reference/duke_forest.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
```{r}
#| label: tbl-duke-data-frame
#| tbl-cap: Top four rows of `duke_forest`.
#| tbl-pos: H
duke_forest |>
select(price, bed, bath, area, year_built, cooling, lot) |>
slice_head(n = 4) |>
kbl(
linesep = "", booktabs = TRUE,
row.names = FALSE, format.args = list(big.mark = ",")
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
)
```
```{r}
#| label: tbl-duke-variables
#| tbl-cap: Variables and their descriptions for the `duke_forest` dataset.
#| tbl-pos: H
duke_forest_var_def <- tribble(
~variable, ~description,
"price", "Sale price, in USD",
"bed", "Number of bedrooms",
"bath", "Number of bathrooms",
"area", "Area of home, in square feet",
"year_built", "Year the home was built",
"cooling", "Cooling system: central or other (other is baseline)",
"lot", "Area of the entire property, in acres"
)
duke_forest_var_def |>
kbl(linesep = "", booktabs = TRUE,
col.names = c("Variable", "Description")) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"),
full_width = FALSE
) |>
column_spec(1, width = "15em", monospace = TRUE) |>
column_spec(2, width = "25em")
```
### Correlating with `price`
As mentioned, the goal of the data collection was to build a model for the sale price of homes.
While using multiple predictor variables is likely preferable to using only one variable, we start by learning about the variables themselves and their relationship to price.
@fig-single-scatter shows scatterplots describing price as a function of each of the predictor variables.
All of the variables seem to be positively associated with price (higher values of the variable are matched with higher price values).
```{r}
#| label: fig-single-scatter
#| out-width: 100%
#| fig-asp: 0.78
#| fig-cap: |
#| Scatterplots describing six different predictor variables'
#| relationship with the price of a home.
#| fig-alt: |
#| Six scatterplots where the observations are different homes.
#| All plots have sale price on the y axis. The x axes are number of
#| bedrooms, number of bathrooms, square feet, year built, colling type
#| and area of property. All variables are positively correlated with
#| price of the house. Square foot of the home is the most highly correlated.
pr_bed <- ggplot(duke_forest, aes(x = bed, y = price)) +
geom_point(alpha = 0.8) +
labs(
x = "Number of bedrooms",
y = "Sale price (USD)"
) +
stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
pr_bath <- ggplot(duke_forest, aes(x = bath, y = price)) +
geom_point(alpha = 0.8) +
labs(
x = "Number of bathrooms",
y = "Sale price (USD)"
) +
stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
pr_area <- ggplot(duke_forest, aes(x = area, y = price)) +
geom_point(alpha = 0.8) +
labs(
x = "Area of home (in square feet)",
y = "Sale price (USD)"
) +
stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
pr_year <- ggplot(duke_forest, aes(x = year_built, y = price)) +
geom_point(alpha = 0.8) +
labs(
x = "Year built",
y = "Sale price (USD)"
) +
stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
pr_cool <- ggplot(duke_forest, aes(x = cooling, y = price)) +
geom_point(alpha = 0.8) +
labs(
x = "Cooling type",
y = "Sale price (USD)"
) +
stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
pr_lot <- ggplot(duke_forest, aes(x = lot, y = price)) +
geom_point(alpha = 0.8) +
labs(
x = "Area of property (in acres)",
y = "Sale price (USD)"
) +
stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
pr_bed + pr_bath + pr_area + pr_year + pr_cool + pr_lot +
plot_layout(ncol = 2)
```
::: {.guidedpractice data-latex=""}
In @fig-single-scatter there does not appear to be a correlation value calculated for the predictor variable, `cooling`.
Why not?
Can the variable still be used in the linear model?[^10-model-applications-1]
:::
[^10-model-applications-1]: The correlation coefficient can only be calculated to describe the relationship between two numerical variables.
The predictor variable `cooling` is categorical, not numerical.
It *can*, however, be used in the linear model as a binary indicator variable coded, for example, with a `1` for central and `0` for other.
::: {.workedexample data-latex=""}
In @fig-single-scatter which variable seems to be most informative for predicting house price?
Provide two reasons for your answer.
------------------------------------------------------------------------
The `area` of the home is the variable which is most highly correlated with `price`.
Additionally, the scatterplot for `price` vs. `area` seems to show a strong linear relationship between the two variables.
Note that the correlation coefficient and the scatterplot linearity will often give the same conclusion.
However, recall that the correlation coefficient is very sensitive to outliers, so it is always wise to look at the scatterplot even when the variables are highly correlated.
:::
\clearpage
### Modeling `price` with `area`
A linear model was fit to predict `price` from `area`.
The resulting model information is given in @tbl-price-slr.
```{r}
#| label: tbl-price-slr
#| tbl-cap: Summary of least squares fit for price on area.
#| tbl-pos: H
m_small <- duke_forest |>
lm(price ~ area, data = _)
m_small_r_sq_adj <- glance(m_small)$adj.r.squared |> round(4)
m_small_df_residual <- glance(m_small)$df.residual |> round(4)
m_small_w_rsq <- m_small |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.001, "<0.0001", round(p.value, 4))) |>
add_row(term = glue("Adjusted R-sq = {m_small_r_sq_adj}")) |>
add_row(term = glue("df = {m_small_df_residual}"))
m_small_w_rsq |>
kbl(linesep = "", booktabs = TRUE,
digits = c(0,0,0,2,4), align = "lrrrr", format.args = list(big.mark = ",")) |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "20em") |>
column_spec(1, monospace = ifelse(as.numeric(rownames(m_small_w_rsq)) < 3, TRUE, FALSE)) |>
column_spec(2:5, width = "5em") |>
pack_rows("", 3,4) |>
add_indent(3:4) |>
row_spec(3:4, italic = TRUE)
```
::: {.guidedpractice data-latex=""}
Interpret the value of $b_1$ = 159 in the context of the problem.[^10-model-applications-2]
:::
[^10-model-applications-2]: For each additional square foot of house, we would expect such houses to cost, on average, \$159 more.
::: {.guidedpractice data-latex=""}
Using the output in @tbl-price-slr, write out the model for predicting `price` from `area`.[^10-model-applications-3]
:::
[^10-model-applications-3]: $\widehat{\texttt{price}} = 116,652 + 159 \times \texttt{area}$
The residuals from the linear model can be used to assess whether a linear model is appropriate.
@fig-price-resid-slr plots the residuals $e_i = y_i - \hat{y}_i$ on the $y$-axis and the fitted (or predicted) values $\hat{y}_i$ on the $x$-axis.
```{r}
#| label: fig-price-resid-slr
#| fig-cap: |
#| Residuals versus predicted values for the model predicting
#| sale price from area of home.
#| fig-alt: |
#| Residual scatterplot showing predicted sale price on the y axis and
#| residual on the y axis from the model on square feet only. There is a fan
#| shape showing a possible deviation from the equal variance condition.
#| fig-width: 7
duke_forest |>
lm(price ~ area, data = _) |>
augment() |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point(size = 2, alpha = 0.8) +
labs(
x = "Predicted values of sale price (in USD)",
y = "Residuals"
) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
```
\clearpage
::: {.guidedpractice data-latex=""}
What aspect(s) of the residual plot indicate that a linear model is appropriate?
What aspect(s) of the residual plot seem concerning when fitting a linear model?[^10-model-applications-4]
:::
[^10-model-applications-4]: The residual plot shows that the relationship between `area` and `price` of a home is indeed linear.
However, the residuals are quite large for expensive homes.
The large residuals indicate potential outliers or increasing variability, either of which could warrant more involved modeling techniques than are presented in this chapter.
### Modeling `price` with multiple variables
It seems as though the predictions of home price might be more accurate if more than one predictor variable was used in the linear model.
@tbl-price-mlr displays the output from a linear model of `price` regressed on `area`, `bed`, `bath`, `year_built`, `cooling`, and `lot`.
```{r}
#| label: tbl-price-mlr
#| tbl-cap: |
#| Summary of least squares fit for price on multiple predictor
#| variables.
#| tbl-pos: H
m_full <- duke_forest |>
lm(price ~ area + bed + bath + year_built + cooling + lot, data = _)
m_full_r_sq_adj <- glance(m_full)$adj.r.squared |> round(4)
m_full_df_residual <- glance(m_full)$df.residual |> round(4)
m_full_w_rsq <- m_full |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.001, "<0.0001", round(p.value, 4))) |>
add_row(term = glue("Adjusted R-sq = {m_full_r_sq_adj}")) |>
add_row(term = glue("df = {m_full_df_residual}"))
m_full_w_rsq |>
kbl(linesep = "", booktabs = TRUE,
digits = c(0,0,0,2,4), align = "lrrrr", format.args = list(big.mark = ",")) |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "20em") |>
column_spec(1, monospace = ifelse(as.numeric(rownames(m_full_w_rsq)) < 8, TRUE, FALSE)) |>
column_spec(2:5, width = "5em") |>
pack_rows("", 8, 9) |>
add_indent(8:9) |>
row_spec(8:9, italic = TRUE)
```
::: {.workedexample data-latex=""}
Using @tbl-price-mlr, write out the linear model of price on the six predictor variables.
------------------------------------------------------------------------
$$
\begin{aligned}
\widehat{\texttt{price}} = -2,910,715 &+ 102 \times \texttt{area} \\
&- 13,692 \times \texttt{bed} \\
&+ 41,076 \times \texttt{bath} \\
&+ 1,459 \times \texttt{year\_built} \\
&+ 84,065 \times \texttt{cooling}_{\texttt{central}} \\
&+ 356,141 \times \texttt{lot}
\end{aligned}
$$
:::
::: {.guidedpractice data-latex=""}
The value of the estimated coefficient on $\texttt{cooling}_{\texttt{central}}$ is $b_5 = 84,065.$ Interpret the value of $b_5$ in the context of the problem.[^10-model-applications-5]
:::
[^10-model-applications-5]: The coefficient indicates that if all the other variables are kept constant, homes with central air conditioning cost \$84,065 more, on average.
A friend suggests that maybe you do not need all six variables to have a good model for `price`.
You consider taking a variable out, but you aren't sure which one to remove.
```{r}
#| label: backward-step-1
m_area_r_sq_adj <- update(m_full, . ~ . - area, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_bed_r_sq_adj <- update(m_full, . ~ . - bed, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_bath_r_sq_adj <- update(m_full, . ~ . - bath, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_year_built_r_sq_adj <- update(m_full, . ~ . - year_built, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_cooling_r_sq_adj <- update(m_full, . ~ . - cooling, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_lot_r_sq_adj <- update(m_full, . ~ . - lot, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
```
::: {.workedexample data-latex=""}
Results corresponding to the full model for the housing data are shown in @tbl-price-mlr.
How should we proceed under the backward elimination strategy?
------------------------------------------------------------------------
Our baseline adjusted $R^2$ from the full model is `r m_full_r_sq_adj`, and we need to determine whether dropping a predictor will improve the adjusted $R^2$.
To check, we fit models that each drop a different predictor, and we record the adjusted $R^2$:
- Excluding `area`: `r m_area_r_sq_adj`
- Excluding `bed`: `r m_bed_r_sq_adj`
- Excluding `bath`: `r m_bath_r_sq_adj`
- Excluding `year_built`: `r m_year_built_r_sq_adj`
- Excluding `cooling`: `r m_cooling_r_sq_adj`
- Excluding `lot`: `r m_lot_r_sq_adj`
The model without `bed` has the highest adjusted $R^2$ of `r m_bed_r_sq_adj`, higher than the adjusted $R^2$ for the full model.
Because eliminating `bed` leads to a model with a higher adjusted $R^2$ than the full model, we drop `bed` from the model.
It might seem counter-intuitive to exclude number of bedrooms from the model.
After all, we would expect homes with more bedrooms to cost more, and we can see a clear relationship between number of bedrooms and sale price in @fig-single-scatter.
However, note that `area` is still in the model, and it's quite likely that the area of the home and the number of bedrooms are highly associated.
Therefore, the model already has information on "how much space is available in the house" with the inclusion of `area`.
Since we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors.
Our baseline adjusted $R^2$ is now `r m_bed_r_sq_adj`.
We fit another set of new models, which consider eliminating each of the remaining predictors in addition to `bed`:
```{r}
#| label: backward-step-2
m_full_no_bed <- duke_forest |>
lm(price ~ area + bath + year_built + cooling + lot, data = _)
m_area_r_sq_adj <- update(m_full_no_bed, . ~ . - area, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_bath_r_sq_adj <- update(m_full_no_bed, . ~ . - bath, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_year_built_r_sq_adj <- update(m_full_no_bed, . ~ . - year_built, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_cooling_r_sq_adj <- update(m_full_no_bed, . ~ . - cooling, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
m_lot_r_sq_adj <- update(m_full_no_bed, . ~ . - lot, data = duke_forest) |> glance() |> pull(adj.r.squared) |> round(4)
```
- Excluding `bed` and `area`: `r m_area_r_sq_adj`
- Excluding `bed` and `bath`: `r m_bath_r_sq_adj`
- Excluding `bed` and `year_built`: `r m_year_built_r_sq_adj`
- Excluding `bed` and `cooling`: `r m_cooling_r_sq_adj`
- Excluding `bed` and `lot`: `r m_lot_r_sq_adj`
None of these models lead to an improvement in adjusted $R^2$, so we do not eliminate any of the remaining predictors.
:::
That is, after backward elimination, we are left with the model that keeps all predictors except `bed`, which we can summarize using the coefficients from @tbl-price-full-except-bed.
```{r}
#| label: tbl-price-full-except-bed
#| tbl-cap: |
#| Summary of least squares fit for price on multiple predictor
#| variables, excluding number of bedrooms.
#| tbl-pos: H
m_full_no_bed_r_sq_adj <- glance(m_full_no_bed)$adj.r.squared |> round(4)
m_full_no_bed_df_residual <- glance(m_full_no_bed)$df.residual |> round(4)
m_full_no_bed_w_rsq <- m_full_no_bed |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.001, "<0.0001", round(p.value, 4))) |>
add_row(term = glue("Adjusted R-sq = {m_full_no_bed_r_sq_adj}")) |>
add_row(term = glue("df = {m_full_no_bed_df_residual}"))
m_full_no_bed_w_rsq |>
kbl(linesep = "", booktabs = TRUE,
digits = c(0,0,0,2,4), align = "lrrrr", format.args = list(big.mark = ",")) |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")) |>
column_spec(1, width = "20em") |>
column_spec(1, monospace = ifelse(as.numeric(rownames(m_full_no_bed_w_rsq)) < 7, TRUE, FALSE)) |>
column_spec(2:5, width = "5em") |>
pack_rows("", 7,8) |>
add_indent(7:8) |>
row_spec(7:8, italic = TRUE)
```
\clearpage
Then, the linear model for predicting sale price based on this model is as follows:
$$
\begin{aligned}
\widehat{\texttt{price}} = &-2,952,641 + 99 \times \texttt{area} + 36,228 \times \texttt{bath} + 1,466 \times \texttt{year\_built} \\
&+ 83,856 \times \texttt{cooling}_{\texttt{central}} + 357,119 \times \texttt{lot}
\end{aligned}
$$
::: {.workedexample data-latex=""}
The residual plot for the model with all of the predictor variables except `bed` is given in @fig-price-resid-mlr-nobed.
How do the residuals in @fig-price-resid-mlr-nobed compare to the residuals in @fig-price-resid-slr?
------------------------------------------------------------------------
The residuals, for the most part, are randomly scattered around 0.
However, there is one extreme outlier with a residual of -\$750,000, a house whose actual sale price is a lot lower than its predicted price.
Also, we observe again that the residuals are quite large for expensive homes.
:::
```{r}
#| label: fig-price-resid-mlr-nobed
#| fig-cap: |
#| Residuals versus predicted values for the model predicting
#| sale price from all predictors except for number of bedrooms.
#| fig-alt: |
#| Residual scatterplot showing predicted sale price on the y axis
#| and residual on the y axis from the model on all variables except number
#| of bedrooms. There is a fan shape showing
#| a possible deviation from the equal variance condition.
#| fig-width: 7
#| fig-asp: 0.5
m_full_no_bed |>
augment() |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point(size = 2, alpha = 0.8) +
labs(
x = "Predicted values of house price (in USD)",
y = "Residuals"
) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
```
```{r}
# note: none of this code is used because the two GPs are now
# hard coded. the coefficients didn't have sig fig, so the
# predictions with rounded coefficients were too different from
# the predicted value that came from predict()
new_house <- tibble(
area = 1803,
bath = 2.5,
lot = 0.145,
year_built = 1941,
cooling = "central"
)
new_house_pred <- round(predict(m_full_no_bed, newdata = new_house), 0)
new_house_obs <- 804133
new_house_resid <- 804133 - new_house_pred
```
::: {.guidedpractice data-latex=""}
Consider a house with 1,803 square feet, 2.5 bathrooms, 0.145 acres, built in 1941, that has central air conditioning.
What is the predicted price of the home?[^10-model-applications-6]
:::
[^10-model-applications-6]: $\widehat{\texttt{price}} = -2,952,641 + 99 \times 1803 + 36,228 \times 2.5 + 1,466 \times 1941 + 83,856 \times 1 + 357,119 \times 0.145 = \$297,570.$
::: {.guidedpractice data-latex=""}
If you later learned that the house (with a predicted price of \$297,570) had recently sold for \$804,133, would you think the model was terrible?
What if you learned that the house was in California?[^10-model-applications-7]
:::
[^10-model-applications-7]: A residual of \$506,563 is reasonably big.
Note that the large residuals (except a few homes) in @fig-price-resid-mlr-nobed are closer to \$250,000 (about half as big).
After we learn that the house is in California, we realize that the model shouldn't be applied to the new home at all!
The original data are from Durham, NC, and models based on the Durham, NC data should be used only to explore patterns in prices for homes in Durham, NC.
\clearpage
## Interactive R tutorials {#sec-model-tutorials}
Navigate the concepts you've learned in this part in R using the following self-paced tutorials.
All you need is your browser to get started!
::: {.alltutorials data-latex=""}
[Tutorial 3: Regression modeling](https://openintrostat.github.io/ims-tutorials/03-model/)
::: {.content-hidden unless-format="pdf"}
<https://openintrostat.github.io/ims-tutorials/03-model>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 1: Visualizing two variables](https://openintro.shinyapps.io/ims-03-model-01/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-01>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 2: Correlation](https://openintro.shinyapps.io/ims-03-model-02/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-02>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 3: Simple linear regression](https://openintro.shinyapps.io/ims-03-model-03/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-03>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 4: Interpreting regression models](https://openintro.shinyapps.io/ims-03-model-04/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-04>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 5: Model fit](https://openintro.shinyapps.io/ims-03-model-05/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-05>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 6: Parallel slopes](https://openintro.shinyapps.io/ims-03-model-06/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-06>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 7: Evaluating and extending parallel slopes model](https://openintro.shinyapps.io/ims-03-model-07/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-07>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 8: Multiple regression](https://openintro.shinyapps.io/ims-03-model-08/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-08>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 9: Logistic regression](https://openintro.shinyapps.io/ims-03-model-09/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-09>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 3 - Lesson 10: Case study: Italian restaurants in NYC](https://openintro.shinyapps.io/ims-03-model-10/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-03-model-10>
:::
:::
::: {.content-hidden unless-format="pdf"}
You can also access the full list of tutorials supporting this book at <https://openintrostat.github.io/ims-tutorials>.
:::
::: {.content-visible when-format="html"}
You can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).
:::
## R labs {#sec-model-labs}
Further apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.
::: {.singlelab data-latex=""}
[Introduction to linear regression - Human Freedom Index](https://www.openintro.org/go?id=ims-r-lab-model)
::: {.content-hidden unless-format="pdf"}
<https://www.openintro.org/go?id=ims-r-lab-model>
:::
:::
::: {.content-hidden unless-format="pdf"}
You can also access the full list of labs supporting this book at <https://www.openintro.org/go?id=ims-r-labs>.
:::
::: {.content-visible when-format="html"}
You can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).
:::