-
Notifications
You must be signed in to change notification settings - Fork 1
/
demo.Rmd
180 lines (129 loc) · 6.14 KB
/
demo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "LLAP Demo"
description: |
How does it all come together?
author:
- name: Michael Clark
url: https://m-clark.github.io
affiliation: CSCAR
affiliation_url: https://cscar.research.umich.edu/
date: "`r Sys.Date()`"
output:
distill::distill_article:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
# install.packages(c("distill", 'huxtable', 'devtools', 'kableExtra', 'ggeffects'))
#
# devtools::install_github('m-clark/visibly')
# devtools::install_github('m-clark/tidyext')
knitr::opts_chunk$set(
echo = F,
eval = T,
message = F,
warning = F,
comment = NA,
R.options = list(width = 120),
cache.rebuild = F,
cache = F,
fig.align = 'center',
fig.asp = .7,
dev = 'svg',
dev.args = list(bg = 'transparent')
)
library(tidyverse); library(broom); library(kableExtra); library(visibly)
kable_df <- function(..., digits=3) {
kable(..., digits=digits) %>%
kable_styling(full_width = F)
}
rnd = function(x, digits = 3) arm::fround(x, digits = digits)
```
```{r init}
library(tidyverse)
library(mgcv)
```
## Introduction
A key idea for 'putting it all together' is *reproducibility*. The more you think about doing something in a way that can be done again and produce the same result, the better you'll be off. A good way to do this is to forgo the standard R script and just start writing.
## Getting Started
Let's model some data. This data is similar to what you've seen.
```{r load_data}
load('demo.RData')
```
Generally we want to describe the data first. I'll use a custom function to create a nice summary table for the variables.
```{r init-describe}
init = tidyext::describe_all(model_variables_demo)
```
First we'll start with a table for the numeric data.
```{r describe-num}
kable_df(init$`Numeric Variables`, digits = 1, caption = 'Numeric Variables')
```
That was easy! Here are some nice tables for categorical variables.
```{r describe-cat}
libcat = init$`Categorical Variables` %>% filter(Variable == 'libuser')
kable_df(libcat %>% select(-Variable), digits = 1, caption = 'Library Users')
```
```{r gendercat}
gendercat = init$`Categorical Variables` %>% filter(Variable == 'gender')
kable_df(gendercat %>% select(-Variable), digits = 1, caption = 'Gender')
```
And I haven't had to spend any time formatting. Plus, if the data changes, I won't take any more effort to reproduce those tables. And I can reference them easily, like for Table \@ref(tab:describe-cat).
Why not a quick visualization? I'll add some color but use a palette that is both colorblind safe and okay if printed in black and white. I'll add a caption via the chunk options.
```{r visage, fig.cap='Age-Award Relationship'}
ggplot(model_variables_demo) +
geom_point(aes(x = age, y = award_total_amount, color = gender), alpha = .15) +
scale_color_viridis_d(begin = .25, end = .75, option = 'C') +
theme_minimal()
```
<aside>Here's a density plot for age. A margin is a good place to put related info that isn't critical to the main content.
```{r aside-demo}
ggplot(model_variables_demo) +
geom_density(aes(x = age))
```
</aside>
Again, no cut and paste, the figures will be reflective of the data no matter what the data is.
Let's put one out to the margin, just because we can. If someone views this doc on a small screen, it'll just slide to the middle.
Let's get on to modeling.
## Modeling
A good approach in modeling is to start with a small but viable model and expanding from there. You should always have more than one idea to test, plus some exploration, because your great ideas might not cover all the ground the data has to offer.
We can start with a standard regression model. It's highly interpretable, but may be too simple for our needs. As before, I can use various tools to create a practically publish-ready table without doing much.
```{r lm}
model_lm = lm(award_total_amount ~ libuser + gender + age,
data = model_variables_demo)
broom::tidy(model_lm) %>%
kable_df(caption = 'Model Results')
```
Or like this. We'll add interactions to compare.
```{r lm2, results='asis'}
model_lm_inter = lm(award_total_amount ~ (libuser + gender)*age,
data = model_variables_demo)
texreg::htmlreg(list(model_lm, model_lm_inter),
single.row = T,
digits = 2,
indentation = ' ',
doctype = F,
caption = 'Model Comparison',
caption.above = T)
```
Now we'll run a model with some more interesting ('wiggly'!) relationships. Interpretation becomes different, but not necessarily more difficult. It will just reflect the complexity of the data more appropriately.
```{r gam}
library(mgcv)
model_gam = gam(award_total_amount ~ libuser + gender + s(age, by = libuser),
data = model_variables_demo)
```
As an example we can visualize the age effect, which was allowed to be curvilinear and interact with library use. We can see that the relationship of age to award is more nuanced for library users.
```{r gam_vis}
library(ggeffects)
gam_preds = ggpredict(model_gam, terms = c('age', 'libuser'))
plot(gam_preds) +
labs(y = 'Award', x = 'Age', title = 'Relationship of age and library use to award amount') +
theme(axis.title.y = element_text(angle = 0, vjust = .90),
title = element_text(size = 8, color = 'gray25'))
```
We can choose the best model via AIC, and I can determine which model is best, and make the text reflect that in an automatic fashion.
```{r bestmod}
bestmod = which.min(sapply(list(model_lm, model_lm_inter, model_gam), AIC))
```
The best model is the `r c('first', 'second', 'third')[bestmod]` one. And even that sentence is reproducible to some extent. A goal would be to have any number, or other text based on data to be code-driven, rather than cut and pasted or entered raw.
## Summary
Data processing will typically take up most of your time. Modeling likewise should be done carefully. Tools are available to make things easier for you, and lots of people make their efforts publicly available for others to use. With a few key packages/modules, a little programming experience, and some luck, you can go very far with your data exploration!