-
Notifications
You must be signed in to change notification settings - Fork 1
/
module_3_nb.Rmd
203 lines (144 loc) · 4.65 KB
/
module_3_nb.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: 'Module 3: Introductory Analytics'
output:
html_document:
df_print: paged
css: other.css
html_notebook:
highlight: pygments
theme: sandstone
css: other.css
editor_options:
chunk_output_type: inline
---
```{r init, echo=FALSE}
# these options are primary useful to the creation of the html document
knitr::opts_chunk$set(
echo=T,
eval = F,
message = F,
warning = F,
comment = NA,
R.options=list(width=120),
cache.rebuild=F,
cache=F,
fig.align='center',
fig.asp = .7,
dev = 'svg',
dev.args=list(bg = 'transparent')
)
```
```{r catchup}
library(tidyverse)
demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
model_variables = read.csv('data/model_variables_anonymized.csv')
```
## Introduction
## Regression
```{r model1}
model_1 = lm(award_total_amount ~ libuser,
data = model_variables)
coef(model_1)
summary(model_1)
```
Note there are only two fitted values, one for library users and one for the non-users.
```{r fitted}
fitted(model_1)[1:10] # first 10 predictions on current data set
```
Use `modelr` to add them to your data set.
```{r modelr_demo}
library(modelr) # need to install?
model_variables %>%
add_predictions(model_1, var = 'Prediction') %>%
group_by(libuser) %>%
summarise(result = mean(Prediction))
```
The `broom` library also has functionality for many common models.
```{r broom_demo}
library(broom) # need to install?
augment(model_1)
```
Use ggeffects for a quick visual!
```{r plot_regression}
library(ggeffects) # need to install?
ggpredict(model_1) %>%
plot()
```
We can assess performance:
- within a model (e.g. Adjusted $R^2$)
- by comparing models (e.g. anova test)
- with prediction on new data (e.g. RMSE)
```{r reg_model_comparison}
model_2 = lm(award_total_amount ~ libuser + gender,
data = model_variables)
anova(model_1, model_2)
```
Other approaches (all must be done on same data!):
- Adjusted R^2
```{r adjr2}
adjr1 = summary(model_1)$adj
adjr2 = summary(model_2)$adj
round(c(adjr1, adjr2), 2)
```
**AIC** is uninterpretable on its own, but reflects the *likelihood* of the data given the model and complexity. When you have multiple AIC values, the lower value is the better model.
```{r aic}
AIC(model_1, model_2)
```
## Classification
```{r model_logreg}
model_logreg = glm(award_total_amount > 5e6 ~ age,
data = model_variables,
family = binomial)
summary(model_logreg)
```
The caret package has a nice function to summarize classification performance via a variety of metrics.
It needs labeled factors to use though. Often you would have this anyway.
```{r caret}
library(caret) # need to install?
predictions = predict(model_logreg) > 0 # same as probability > .5
predictions = factor(predictions, labels = c('low', 'high'))
target = model_variables$award_total_amount > 5e6
target = factor(target, labels = c('low', 'high'))
confusionMatrix(predictions, target)
```
## Python examples
While Python has statistical modeling capabilities, it's nowhere near the level of R. If you just want to stay in the Python world, you can do basic models without issue. However, it suffers from consistency of method, documentation, stability across updates etc. You might be better off simply calling R with something like `rpy2` or a similar approach.
### Init
```{python py_init, engine.path= '/Users/micl/anaconda3/bin/python'}
# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np
import statsmodels
model_variables = pd.read_csv('data/model_variables_anonymized.csv')
```
### Statsmodels
`Statsmodels` is Python's attempt to do statistical modeling. It essentially follows R's formula style.
```{python statsmodels}
import statsmodels.api as sm
import statsmodels.formula.api as smf
# specify the model, then fit
model_1 = smf.ols('award_total_amount ~ libuser', data=model_variables)
results_1 = model_1.fit()
# Inspect the results
print(results_1.summary())
```
```{python regmodel_2}
model_2 = smf.ols('award_total_amount ~ libuser + gender', data=model_variables)
results_2 = model_2.fit()
# Inspect the results
print(results_2.summary())
```
```{python model_compare}
print(sm.stats.anova_lm(results_1, results_2))
```
### Classification
```{python logistic}
model_variables['award_high'] = 1*(model_variables['award_total_amount'] > 5e6) # 1* makes it a numeric
logit_mod = smf.logit('award_high ~ libuser + gender', data = model_variables)
logit_result = logit_mod.fit()
print(logit_result.summary())
```
```{python confusion_matrix}
logit_result.pred_table() / model_variables.shape[0]
```