-
Notifications
You must be signed in to change notification settings - Fork 0
/
term_project.Rmd
329 lines (229 loc) · 15.3 KB
/
term_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
title: "Term Project"
output:
html_document:
toc: yes
toc_depth: 2
toc_float:
collapsed: yes
---
<style>
h1{font-weight: 400;}
</style>
```{r, message=FALSE, echo=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(moderndive)
library(patchwork)
set.seed(76)
```
Everything in this course builds up to the term group project, where there is
only one learning goal: *Engage in the data/science research pipeline in as
faithful a manner as possible while maintaining a level suitable for novices.*
<center>
![](static/images/pipeline.png){ width=600px }
</center>
***
# 4. Resubmission - Fri 12/21 5pm {#resubmission}
1. Revise your work based on delivered feedback.
1. Complete remaining sections.
1. Complete "Inference for multiple regression" and "Conclusion" sections.
+ While you only need to present the results of one model in this term project, in this section make a brief mention why you chose one model over another.
+ Perform a residual analysis.
+ **Added on 12/10**: You do not need to perform any simulations of sampling/bootstrap/null distributions. You only need interpret the p-value and confidence interval columns of your regression table.
+ **Added on 12/10**: Use R Markdown footnotes for citations. For example, adding `^[Here is an example footnote.]` will add an automatically numbered footnote as seen here^[Here is an example footnote] and here^[Here is another example footnote]. Please use [MLA citation format](https://owl.purdue.edu/owl/research_and_citation/mla_style/mla_style_introduction.html){target="_blank"}.
1. Group leader: Once the resubmission is complete
+ Knit `Term_Project.Rmd` one final time.
+ Republish to Rpubs.com
+ **Added on 12/13** Post `Term_Project.Rmd` and all other necessary files on Moddle.
1. **After your group has resubmitted the project** complete this Google Forms [survey](https://docs.google.com/forms/d/e/1FAIpQLSeEmiXTB6qIszAC5r3gLGmDuLQEYvrgRW-bRMezx6te7_7jpQ/viewform){target="_blank"}. 5% of your term project grade is based on completion of this survey.
## Evaluation criteria {#evaluationcriteria}
You will be evaluated on the following criteria, which not only emphasizes the data, statistics, and modeling, but also the **communication**, an often neglected criteria:
1. The honor code
+ Your project must adhere to the Smith College Academic Honor Code
Statement. In particular all external sources must be cited in your
submissions.
1. The report
+ Is the grammar correct and are there no misspellings? (Click the ABC
spell-check button to the left of the "Knit" button)
+ Is the writing crisp and concise or is it unnecessarily verbose and wordy? Is the writing clear or is it sloppy?
+ Did you make use of Markdown formatting tools to format the document (bold, italics, url links, etc)?
See RStudio Menu Bar -> Help -> Markdown Quick Reference for all formatting options.
+ Is the final project document clean and easy to read?
1. The science and the data
+ Is the data's context/source clearly discussed/given? Recall: *Numbers are numbers, but data
has context.*
+ Are all limitations and issues with the data addressed?
+ Is the research question of interest clearly stated?
+ Are the plots/tables polished? Titles, axes labels, legends?
+ Are the plots/tables truly informative or are they included merely for their own sake?
1. The statistics and the analysis
+ Are all statistical/modeling and analyses interpretations valid?
+ Are limitations of the analysis (if any) clearly stated?
+ Are the non-statistical interpretations accessible to an audience not well-versed in statistics?
1. The code
+ Is your code legible and understandable to someone not in your group? Could someone else look at the code in your `.Rmd` file, understand it, and use it for themselves? In other words, is the research easily [reproducible](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970){target="_blank"}?
+ Is your code cleanly formatted? Are you using indentations, spaces, and line breaks effectively?
+ Some examples of good coding style can be found [here](http://style.tidyverse.org/){target="_blank"}
1. The rest
+ Did you demonstrate effort and engagement during the semester long process?
## Example past projects {#past_examples}
* [Sweet Home Alabama: Voter Support for Trump and Moore Across Racially-Divided Counties](http://rpubs.com/mbhandari20/374964){target="_blank"}
* [The World of Dark Chocolate](http://rpubs.com/amemily/383723){target="_blank"}
* [The Average Yards of an Above-Average Quarterback: Examining Tom Brady’s average yards and scoring](http://rpubs.com/cmacgillivray19/TB12draft1){target="_blank"}
* [Instagram Followers](http://rpubs.com/dahanjosh/InitialSubmission){target="_blank"}
***
# DONE: 1. Groups - Fri 9/21 5pm {#groups}
**To do:**
1. Form groups of 2-3 students and pick a group name.
1. Designate a group leader who will:
a) Slack message your group name in a Direct Message that includes
+ All group members
+ Myself
a) Complete the following [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSehYx8pNGxS6P7KF8y2f-A2RgnhbvKmDW77TijoeOanpV2DHQ/viewform){target="_blank"}
**Notes**:
* If you need a group to join please Slack me.
* All groups members are expected to contribute and a system will be put in
place to hold all group members accountable for their work.
# DONE: 2. Proposal - Fri 10/19 5pm {#proposal}
**To do:**
1. Background reading on data: Read ModernDive Chapters 4.1 - 4.3
1. Find a dataset
1. Submit your group proposal
## Find a dataset
1. Requirements:
+ The data should be stored in a *single* Excel spreadsheet or CSV file. Read [ModernDive 4.3](https://moderndive.netlify.com/4-tidy.html#csv){target="_blank"} on how to import a spreadsheet into R.
+ The data should be in "tidy" data format, which is defined in [ModernDive 4.1](https://moderndive.netlify.com/4-tidy.html#what-is-tidy-data){target="_blank"}. If you need help converting a dataset to tidy format, visit the Spinelli tutoring (Sunday-Thursday 7-9pm) center for help or ask me!
+ Columns/Variables. You dataset should have the following variables that will be the focus of your analysis. Read ModernDive Chapter 6 to the end of Section 6.1 for what these terms mean.
1. One numerical variable to be used as your *outcome variable*.
1. One categorical *explanatory/predictor* variable with no more than 5 levels.
1. One numerical *explanatory/predictor* variable.
1. Any *identification variables* (read [ModernDive 4.2.2](https://moderndive.netlify.com/4-tidy.html#identification-vs-measurement){target="_blank"} for a distinction of identification vs measurement variables)
+ Rows/Observations: At least 50 observations.
1. Possible data sources
1. Consult the Spinelli Quantitative Learning Center [Data Counselor Raul Zelada Aprili](https://www.smith.edu/qlc/tutoring.html?colDataCnslr=open#PanelDataCnslr){target="_blank"}
1. [Google Dataset Search](https://toolbox.google.com/datasetsearch){target="_blank"}
1. [data.world](https://data.world/){target="_blank"}
1. [Kaggle](https://www.kaggle.com/datasets){target="_blank"}
## Submission Format
Follow the <a href="static/term_project/project_proposal.R" download>`project_proposal.R`</a> template file and submit this on [Moodle](https://moodle.smith.edu/course/view.php?id=30498){target="_blank"} by Friday 10/19 5pm. In this template file, I've included an example based on the **exploratory data analysis** of the Seattle House Prices data in ModernDive [Chapter 12.1.1](https://moderndive.netlify.com/12-thinking-with-data.html#seattle-house-prices){target="_blank"}.
## Where is this heading?
For the Phase 3 of the project "Initial Submission", due Friday 11/9, you'll be making a figure like Figure 12.6 in ModernDive Chapter 12
```{r, echo=FALSE, eval=TRUE}
library(tidyverse)
library(moderndive)
house_prices <- house_prices %>%
mutate(
log10_price = log10(price),
log10_size = log10(sqft_living)
)
ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) +
geom_point(alpha = 0.1) +
labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") +
geom_smooth(method = "lm", se = FALSE)
```
<!--
## Setup
* Login to RStudio Server -> Top right -> Click on "Project" -> "Shared with me" -> Your group name. This should open a new [*RStudio Server Shared Project*](https://support.rstudio.com/hc/en-us/articles/211659737-Sharing-Projects-in-RStudio-Server-Pro) that all group members have joint access to where you can perform collaborative editing like Google Docs. Albert and your section's TA will have access as well.
* Group leader: Create a new `scratchpad.R`
* Setup colloborative editing: Click on the colored initials of your teammates on the top right of the screen and click "Follow cursor". Play around with this.
* Group leader: Upload the <a href="static/term_project/Term_Project_Proposal.Rmd"
download>`Term_Project_Proposal.Rmd`</a> R Markdown template file file to the Shared Project so that all group members have access.
* To return to your personal folder with your problem sets: RStudio Server -> Top right -> Click on "Project" -> "Close project"
## Submit your group proposal
Once your proposal is ready, the group leader will:
1. Knit `Term_Project_Proposal.Rmd` one final time.
1. Publish to Rpubs.com by clicking the blue "Publish" button on top right of the HTML document. Copy the URL of the resulting webpage.
1. Complete this [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSf_MFKFv65DviSyk7EYuPfPAqq_ZI3nHrXw_LuUZLia8KJtgQ/viewform){target="_blank"}. No need to submit any files on DropBox, as the TA's and I can login into your Shared Projects and look at your work there.
-->
***
# DONE: 3. Initial submission - Fri 11/9 9pm
**To do**:
1. **Changed on 11/2** Due time changed from 5pm to 6pm.
1. Download the <a href="static/term_project/Term_Project.Rmd"
download>`Term_Project.Rmd`</a> R Markdown template file. Recall the [past examples](#past_examples) you saw previously.
1. Read the evaluation criteria [below](#evaluationcriteria) and then complete the following sections:
1. Introduction
2. EDA
3. Multiple regression
6. Citations. Be sure to replace the Rpubs link with a link to a published Rpubs webpage of your term project.
1. Group leader: Submit this on [Moodle](https://moodle.smith.edu/course/view.php?id=30498){target="_blank"}.
## Tips {#initialtips}
### 1. log10 transformations
If you have skewed explanatory and/or outcome variables, you should be `log10()`-transforming them and using the transformed variables in your regression and not just visually displaying them with transformed axes. See below:
```{r, message=FALSE, warning=FALSE, eval=FALSE, echo=TRUE, fig.height=8/2}
library(ggplot2)
library(dplyr)
library(moderndive)
# log10() transform the skewed variables
house_prices <- house_prices %>%
mutate(
log10_price = log10(price),
log10_size = log10(sqft_living)
)
# Plot price with re-scaled axes:
ggplot(house_prices, aes(x = price)) +
geom_histogram() +
scale_x_log10() +
labs(x = "House price (log10-scale)", title = "Seattle house prices")
# Plot log10-transformed price with regular axes:
ggplot(house_prices, aes(x = log10_price)) +
geom_histogram() +
labs(x = "log10(House price)", title = "Seattle house prices")
```
```{r, message=FALSE, warning=FALSE, eval=TRUE, echo=FALSE, fig.height=8/2, cache=TRUE}
# log10() transformations
house_prices <- house_prices %>%
mutate(
log10_price = log10(price),
log10_size = log10(sqft_living)
)
p1 <- ggplot(house_prices, aes(x = price)) +
geom_histogram() +
scale_x_log10() +
labs(x = "House price (log10-scale)", title = "Seattle house prices (log10-scale)")
p2 <- ggplot(house_prices, aes(x = log10_price)) +
geom_histogram() +
labs(x = "log10(House price)", title = "Seattle house prices (log10 transformed)")
p1 + p2
```
### 2. Model selection
Which model should I use? Parallel slopes or interaction model?
### 3. Useful tips for R projects
Jenny wrote up a document of useful tips for R projects for another class. Give it a quick scan for lots of useful tips!
* [HTML document](static/project_tips/project_tips.html){target="_blank"}
* <a href="static/project_tips/project_tips.Rmd" download>`project_tips.Rmd`</a> R Markdown source code
Become an R Ninja!
<center>
![](static/project_tips/data_ninja1.png){ width=200px }
</center>
<!--
TODO: Some of the groups don’t know whether or not to include interaction
Come up with consistent approach for how to do this
What to include
-->
<!--
## 3.a) Feedback from previous projects.
1. Jarring hyperlink/code output:
+ In EDA, show a preview of first 6 rows of data using `head()`; this is better than using `glimpse()` as computer code output is jarring to read.
+ Correlations code output is ok, since they don’t take up a lot of space
+ Raw hyperlinks in the body of text or citation section should be converted to text links. See RStudio Menu Bar -> Help -> Markdown Quick Reference -> Links.
1. EDA
+ Make sure you are explicit about what your observational unit is. In other words:
a) what each row in your dataset represents
a) what each point in your scatterplots represent
+ Minimize redundancy: Many had a colored scatterplot in Section 2 EDA, then the same plot with fitted lines in Section 3. Put only the latter in Section 2.
+ Main visualization: toy around with using facets. Pick what you think is best and own it.
1. Formatting. Once finished writing your project:
+ Remove any instructions text. Ex: "Discuss the research question you will be addressing with your multiple regression model."
+ Fix typos using the ABC-check button in R Markdown panel (next to magnifying glass buttom).
1. Statistical comments
+ Look at histogram of your numerical variables, in particular the outcome variable. Are they really right skewed? Is a log base 10 (or log base e) transformation worth considering (like in DataCamp or the Seattle House prices example in ModernDive Chapter 12)
+ In all your conclusions, watch out for making statements that too strongly suggest causation (unless you are sure the data was collected in a way that allows for this). Recall that we should set tone to only more conservatively suggest associations.
1. Non-statistical interpretation:
+ Meant for a non-technical audience
+ Think of this as the overall executive summary, the take-home message meant for a broad audience.
1. Interpretations. Say your categorical explanatory variable has $k$ levels:
+ Instead of intrepreting all intercept/slope coefficients individually, write out fitted regression line equations for $\widehat{y}$ for all $k$ possible levels. Example: if using x = `gender` from evals, write out one equation for male and another for female.
+ ~~Then of your $k$ fitted regression line equations, interpret just the slope coefficient for 2-3 of the resulting equations~~ **Clarifications**: Then of all the rows in your regression table, for 2-3 of them interpret the coefficient estimate and its inferential significance. Try to pick a slope or interaction effect coefficient, since these speak to the relationship between your outcome and explanatory variable. Use your judgement as to which to choose (most interesting, most relevant).
-->