-
Notifications
You must be signed in to change notification settings - Fork 36
/
lab4.Rmd
116 lines (93 loc) · 3.77 KB
/
lab4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
title: "In-Class Lab 4"
author: "ECON 4223 (Prof. Tyler Ransom, U of Oklahoma)"
date: "February 1, 2022"
output:
html_document:
df_print: paged
pdf_document: default
word_document: default
bibliography: biblio.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = 'hide', fig.keep = 'none')
```
The purpose of this in-class lab is to further practice your regression skills. To get credit, upload your .R script to the appropriate place on Canvas.
## For starters
Open up a new R script (named `ICL4_XYZ.R`, where `XYZ` are your initials) and add the usual "preamble" to the top:
```{r message=FALSE, warning=FALSE, paged.print=FALSE}
# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(modelsummary)
```
For this lab, let's use data on house prices. This is located in the `hprice1` data set in the `wooldridge` package. Each observation is a house.
```{r}
df <- as_tibble(hprice1)
```
Check out what's in `df` by typing
```{r}
glimpse(df)
```
Or for some summary statistics:
```{r}
datasummary_skim(df,histogram=FALSE)
```
## Multiple Regression
Let's estimate the following regression model:
\[
price = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + u
\]
where $price$ is the house price in thousands of dollars.
The code to do so is:
```{r}
est1 <- lm(price ~ sqrft + bdrms, data=df)
modelsummary(est1)
```
You should get a coefficient of `0.128` on `sqrft` and `15.2` on `bdrms`. Interpret these coefficients. (You can type the interpretation as a comment in your .R script.) Do these numbers seem reasonable?
You should get $R^2 = 0.632$. Based on that number, do you think this is a good model of house prices?
Check that the average of the residuals is zero:
```{r}
mean(est1$residuals)
```
## Adding in non-linearities
The previous regression model had an estimated intercept of `-19.3`, meaning that a home with no bedrooms and no square footage would be expected to have a sale price of -$19,300.
To ensure that our model always predicts a positive sale price, let's instead use $log(price)$ as the dependent variable. Let's also add quadratic terms for `sqrft` and `bdrms` to allow those to exhibit diminishing marginal returns.
First, let's use `mutate()` to add these new variables:
```{r}
df <- df %>% mutate(logprice = log(price), sqrftSq = sqrft^2, bdrmSq = bdrms^2)
```
Now run the new model:
```{r}
est2 <- lm(logprice ~ sqrft + sqrftSq + bdrms + bdrmSq, data=df)
modelsummary(est2)
# or, for more decimals:
modelsummary(est2, fmt = 10)
```
The new coefficients have much smaller magnitudes. Explain why that might be.
The new $R^2 = 0.595$ which is less than $0.632$ from before. Does that mean this model is worse?
## Using the Frisch-Waugh Theorem to obtain partial effects
Let's experiment with the Frisch-Waugh Theorem, which says:
\[
\hat{\beta}_{1} = \frac{\sum_{i=1}^{N} \hat{r}_{i1}y_{i}}{\sum_{i=1}^{N} \hat{r}_{i1}^2}
\]
where $\hat{r}_{i1}$ is the residual from a regression of $x_{1}$ on $x_{2},\ldots,x_{k}$
Let's do this for the model we just ran. First, regress `sqrft` on the other $X$'s and store the residuals as a new column in `df`.
```{r}
est <- lm(sqrft ~ sqrftSq + bdrms + bdrmSq, data=df)
df <- df %>% mutate(sqrft.resid = est$residuals)
```
Now, if we run a simple regression of `logprice` on `sqrft.resid` we should get the same coefficient as that of `sqrft` in the original regression (=`3.74e-4`).
```{r}
est <- lm(logprice ~ sqrft.resid, data=df)
modelsummary(est)
```
## Frisch-Waugh by hand
We can also compute the Frisch-Waugh formula by hand:
```{r}
beta1 <- sum(df$sqrft.resid*df$logprice)/sum(df$sqrft.resid^2)
print(beta1)
```
Which indeed gives us what we expected.
# References