term_project.Rmd

---
title: "Term Project"
output:
  html_document:
    toc: yes
    toc_depth: 2
    toc_float:
      collapsed: yes
---

<style>
h1{font-weight: 400;}
</style>

```{r, message=FALSE, echo=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(moderndive)
library(patchwork)
set.seed(76)
```

Everything in this course builds up to the term group project, where there is
only one learning goal: *Engage in the data/science research pipeline in as
faithful a manner as possible while maintaining a level suitable for novices.*

<center>
![](static/images/pipeline.png){ width=600px }
</center>


***


# 4. Resubmission - Fri 12/21 5pm {#resubmission}

1. Revise your work based on delivered feedback. 
1. Complete remaining sections.
1. Complete "Inference for multiple regression" and "Conclusion" sections.
    + While you only need to present the results of one model in this term project, in this section make a brief mention why you chose one model over another.
    + Perform a residual analysis.
    + **Added on 12/10**: You do not need to perform any simulations of sampling/bootstrap/null distributions. You only need interpret the p-value and confidence interval columns of your regression table. 
    + **Added on 12/10**: Use R Markdown footnotes for citations. For example, adding `^[Here is an example footnote.]` will add an automatically numbered footnote as seen here^[Here is an example footnote] and here^[Here is another example footnote]. Please use [MLA citation format](https://owl.purdue.edu/owl/research_and_citation/mla_style/mla_style_introduction.html){target="_blank"}.
1. Group leader: Once the resubmission is complete
    + Knit `Term_Project.Rmd` one final time.
    + Republish to Rpubs.com
    + **Added on 12/13** Post `Term_Project.Rmd` and all other necessary files on Moddle.
1. **After your group has resubmitted the project** complete this Google Forms [survey](https://docs.google.com/forms/d/e/1FAIpQLSeEmiXTB6qIszAC5r3gLGmDuLQEYvrgRW-bRMezx6te7_7jpQ/viewform){target="_blank"}. 5% of your term project grade is based on completion of this survey. 


## Evaluation criteria {#evaluationcriteria}

You will be evaluated on the following criteria, which not only emphasizes the data, statistics, and modeling, but also the **communication**, an often neglected criteria:

1. The honor code
    + Your project must adhere to the Smith College Academic Honor Code
    Statement. In particular all external sources must be cited in your
    submissions.
1. The report
    + Is the grammar correct and are there no misspellings? (Click the ABC
    spell-check button to the left of the "Knit" button)
    + Is the writing crisp and concise or is it unnecessarily verbose and wordy? Is the writing clear or is it sloppy?
    + Did you make use of Markdown formatting tools to format the document (bold, italics, url links, etc)?  
    See RStudio Menu Bar -> Help -> Markdown Quick Reference for all formatting options.
    + Is the final project document clean and easy to read?
1. The science and the data
    + Is the data's context/source clearly discussed/given? Recall: *Numbers are numbers, but data
    has context.*
    + Are all limitations and issues with the data addressed?
    + Is the research question of interest clearly stated?   
    + Are the plots/tables polished? Titles, axes labels, legends?
    + Are the plots/tables truly informative or are they included merely for their own sake?
1. The statistics and the analysis
    + Are all statistical/modeling and analyses interpretations valid?
    + Are limitations of the analysis (if any) clearly stated?
    + Are the non-statistical interpretations accessible to an audience not well-versed in statistics?
1. The code
    + Is your code legible and understandable to someone not in your group? Could someone else look at the code in your `.Rmd` file, understand it, and use it for themselves? In other words, is the research easily [reproducible](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970){target="_blank"}?
    + Is your code cleanly formatted? Are you using indentations, spaces, and line breaks effectively?
    + Some examples of good coding style can be found [here](http://style.tidyverse.org/){target="_blank"}
1. The rest
    + Did you demonstrate effort and engagement during the semester long process?


## Example past projects {#past_examples}

* [Sweet Home Alabama: Voter Support for Trump and Moore Across Racially-Divided Counties](http://rpubs.com/mbhandari20/374964){target="_blank"}
* [The World of Dark Chocolate](http://rpubs.com/amemily/383723){target="_blank"}
* [The Average Yards of an Above-Average Quarterback: Examining Tom Brady’s average yards and scoring](http://rpubs.com/cmacgillivray19/TB12draft1){target="_blank"}
* [Instagram Followers](http://rpubs.com/dahanjosh/InitialSubmission){target="_blank"}


***


# DONE: 1. Groups - Fri 9/21 5pm {#groups}

**To do:**

1. Form groups of 2-3 students and pick a group name.
1. Designate a group leader who will:
    a) Slack message your group name in a Direct Message that includes
        + All group members
        + Myself
    a) Complete the following [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSehYx8pNGxS6P7KF8y2f-A2RgnhbvKmDW77TijoeOanpV2DHQ/viewform){target="_blank"}


**Notes**:

* If you need a group to join please Slack me.
* All groups members are expected to contribute and a system will be put in
place to hold all group members accountable for their work.

# DONE: 2. Proposal - Fri 10/19 5pm {#proposal}

**To do:**

1. Background reading on data: Read ModernDive Chapters 4.1 - 4.3
1. Find a dataset
1. Submit your group proposal


## Find a dataset

1. Requirements:
    + The data should be stored in a *single* Excel spreadsheet or CSV file. Read [ModernDive 4.3](https://moderndive.netlify.com/4-tidy.html#csv){target="_blank"} on how to import a spreadsheet into R.
    + The data should be in "tidy" data format, which is defined in [ModernDive 4.1](https://moderndive.netlify.com/4-tidy.html#what-is-tidy-data){target="_blank"}. If you need help converting a dataset to tidy format, visit the Spinelli tutoring (Sunday-Thursday 7-9pm) center for help or ask me!
    + Columns/Variables. You dataset should have the following variables that will be the focus of your analysis. Read ModernDive Chapter 6 to the end of Section 6.1 for what these terms mean.
        1. One numerical variable to be used as your *outcome variable*. 
        1. One categorical *explanatory/predictor* variable with no more than 5 levels.
        1. One numerical *explanatory/predictor* variable.
        1. Any *identification variables* (read [ModernDive 4.2.2](https://moderndive.netlify.com/4-tidy.html#identification-vs-measurement){target="_blank"} for a distinction of identification vs measurement variables)
    + Rows/Observations: At least 50 observations.
1. Possible data sources
    1. Consult the Spinelli Quantitative Learning Center [Data Counselor Raul Zelada     Aprili](https://www.smith.edu/qlc/tutoring.html?colDataCnslr=open#PanelDataCnslr){target="_blank"}
    1. [Google Dataset Search](https://toolbox.google.com/datasetsearch){target="_blank"}
    1. [data.world](https://data.world/){target="_blank"}
    1. [Kaggle](https://www.kaggle.com/datasets){target="_blank"}


## Submission Format

Follow the <a href="static/term_project/project_proposal.R" download>`project_proposal.R`</a> template file and submit this on [Moodle](https://moodle.smith.edu/course/view.php?id=30498){target="_blank"} by Friday 10/19 5pm. In this template file, I've included an example based on the **exploratory data analysis** of the Seattle House Prices data in ModernDive [Chapter 12.1.1](https://moderndive.netlify.com/12-thinking-with-data.html#seattle-house-prices){target="_blank"}.


## Where is this heading?

For the Phase 3 of the project "Initial Submission", due Friday 11/9, you'll be making a figure like Figure 12.6 in ModernDive Chapter 12

```{r, echo=FALSE, eval=TRUE}
library(tidyverse)
library(moderndive)
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_size = log10(sqft_living)
    )
ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) +
  geom_point(alpha = 0.1) +
  labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") +
  geom_smooth(method = "lm", se = FALSE)
```


<!-- 


## Setup

* Login to RStudio Server -> Top right -> Click on "Project" -> "Shared with me" -> Your group name. This should open a new [*RStudio Server Shared Project*](https://support.rstudio.com/hc/en-us/articles/211659737-Sharing-Projects-in-RStudio-Server-Pro) that all group members have joint access to where you can perform collaborative editing like Google Docs. Albert and your section's TA will have access as well.
* Group leader: Create a new `scratchpad.R`
* Setup colloborative editing: Click on the colored initials of your teammates on the top right of the screen and click "Follow cursor". Play around with this.
* Group leader: Upload the <a href="static/term_project/Term_Project_Proposal.Rmd"
download>`Term_Project_Proposal.Rmd`</a> R Markdown template file file to the Shared Project so that all group members have access. 
* To return to your personal folder with your problem sets: RStudio Server -> Top right -> Click on "Project" -> "Close project"


## Submit your group proposal

Once your proposal is ready, the group leader will:

1. Knit `Term_Project_Proposal.Rmd` one final time.
1. Publish to Rpubs.com by clicking the blue "Publish" button on top right of the HTML document. Copy the URL of the resulting webpage.
1. Complete this [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSf_MFKFv65DviSyk7EYuPfPAqq_ZI3nHrXw_LuUZLia8KJtgQ/viewform){target="_blank"}. No need to submit any files on DropBox, as the TA's and I can login into your Shared Projects and look at your work there.

-->


***


# DONE: 3. Initial submission - Fri 11/9 9pm

**To do**:

1. **Changed on 11/2** Due time changed from 5pm to 6pm. 
1. Download the <a href="static/term_project/Term_Project.Rmd"
download>`Term_Project.Rmd`</a> R Markdown template file. Recall the [past examples](#past_examples) you saw previously.
1. Read the evaluation criteria [below](#evaluationcriteria) and then complete the following sections:
    1. Introduction
    2. EDA
    3. Multiple regression
    6. Citations. Be sure to replace the Rpubs link with a link to a published Rpubs webpage of your term project.
1. Group leader: Submit this on [Moodle](https://moodle.smith.edu/course/view.php?id=30498){target="_blank"}.


## Tips {#initialtips}

### 1. log10 transformations

If you have skewed explanatory and/or outcome variables, you should be `log10()`-transforming them and using the transformed variables in your regression and not just visually displaying them with transformed axes. See below:

```{r, message=FALSE, warning=FALSE, eval=FALSE, echo=TRUE, fig.height=8/2}
library(ggplot2)
library(dplyr)
library(moderndive)

# log10() transform the skewed variables
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_size = log10(sqft_living)
  )

# Plot price with re-scaled axes:
ggplot(house_prices, aes(x = price)) +
  geom_histogram() +
  scale_x_log10() + 
  labs(x = "House price (log10-scale)", title = "Seattle house prices") 

# Plot log10-transformed price with regular axes:
ggplot(house_prices, aes(x = log10_price)) +
  geom_histogram() +
  labs(x = "log10(House price)", title = "Seattle house prices") 
```

```{r, message=FALSE, warning=FALSE, eval=TRUE, echo=FALSE, fig.height=8/2, cache=TRUE}
# log10() transformations
house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_size = log10(sqft_living)
  )

p1 <- ggplot(house_prices, aes(x = price)) +
  geom_histogram() +
  scale_x_log10() + 
  labs(x = "House price (log10-scale)", title = "Seattle house prices (log10-scale)") 

p2 <- ggplot(house_prices, aes(x = log10_price)) +
  geom_histogram() +
  labs(x = "log10(House price)", title = "Seattle house prices (log10 transformed)") 

p1 + p2
```


### 2. Model selection

Which model should I use? Parallel slopes or interaction model? 


### 3. Useful tips for R projects

Jenny wrote up a document of useful tips for R projects for another class. Give it a quick scan for lots of useful tips!

* [HTML document](static/project_tips/project_tips.html){target="_blank"}
* <a href="static/project_tips/project_tips.Rmd" download>`project_tips.Rmd`</a> R Markdown source code

Become an R Ninja!

<center>
![](static/project_tips/data_ninja1.png){ width=200px }
</center>

<!--
TODO: Some of the groups don’t know whether or not to include interaction
Come up with consistent approach for how to do this
What to include
-->


<!--
## 3.a) Feedback from previous projects.


1. Jarring hyperlink/code output:
    + In EDA, show a preview of first 6 rows of data using `head()`; this is better than using `glimpse()` as computer code output is jarring to read.
    + Correlations code output is ok, since they don’t take up a lot of space
    + Raw hyperlinks in the body of text or citation section should be converted to text links. See RStudio Menu Bar -> Help -> Markdown Quick Reference -> Links.
1. EDA
    + Make sure you are explicit about what your observational unit is. In other words:
        a) what each row in your dataset represents
        a) what each point in your scatterplots represent
    + Minimize redundancy: Many had a colored scatterplot in Section 2 EDA, then the same plot with fitted lines in Section 3. Put only the latter in Section 2.
    + Main visualization: toy around with using facets. Pick what you think is best and own it.
1. Formatting. Once finished writing your project:
    + Remove any instructions text. Ex: "Discuss the research question you will be addressing with your multiple regression model."
    + Fix typos using the ABC-check button in R Markdown panel (next to magnifying glass buttom).
1. Statistical comments
    + Look at histogram of your numerical variables, in particular the outcome variable. Are they really right skewed? Is a log base 10 (or log base e) transformation worth considering (like in DataCamp or the Seattle House prices example in ModernDive Chapter 12)
    + In all your conclusions, watch out for making statements that too strongly suggest causation (unless you are sure the data was collected in a way that allows for this). Recall that we should set tone to only more conservatively suggest associations. 
1. Non-statistical interpretation:
    + Meant for a non-technical audience
    + Think of this as the overall executive summary, the take-home message meant for a broad audience.
1. Interpretations. Say your categorical explanatory variable has $k$ levels:
    + Instead of intrepreting all intercept/slope coefficients individually, write out fitted regression line equations for $\widehat{y}$ for all $k$ possible levels. Example: if using x = `gender` from evals, write out one equation for male and another for female.
    + ~~Then of your $k$ fitted regression line equations, interpret just the slope coefficient for 2-3 of the resulting equations~~ **Clarifications**: Then of all the rows in your regression table, for 2-3 of them interpret the coefficient estimate and its inferential significance. Try to pick a slope or interaction effect coefficient, since these speak to the relationship between your outcome and explanatory variable. Use your judgement as to which to choose (most interesting, most relevant).

-->