module_3_slides.Rmd

---
title: "Module 3 <br><br> Introductory Analytics"
author: ""
# date: "`r Sys.Date()`"
output: 
  xaringan::moon_reader:
    lib_dir: libs
    css: [xaringan-themer.css, other.css]
    nature:
      highlightStyle: tomorrow
      highlightLines: true
      countIncrementalSlides: false
---

<style type="text/css">
.remark-slide-content {
    font-size: 30px;
    padding: 1em 4em 1em 4em;
}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo=T, 
  eval = F,
  message = F, 
  warning = F, 
  comment = NA,
  R.options=list(width=120), 
  cache.rebuild=F, 
  cache=T,
  fig.align='center', 
  fig.asp = .7,
  dev = 'svg', 
  dev.args=list(bg = 'transparent')
)

library(tidyverse); library(broom); library(kableExtra); library(visibly)

kable_df <- function(..., digits=3) {
  kable(..., digits=digits) %>% 
    kable_styling(full_width = F)
}

rnd = function(x, digits = 3) arm::fround(x, digits = digits)

demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
model_variables = read.csv('data/model_variables_anonymized.csv')
```


## Introduction

Regression
- Basic model for continuous(-y) target variables 


Classification
- for categorical outcomes

---


## Regression Model

Basic model: Our output/target/response $y$ is a function of predictors/covariates/features $x$

<br>

$$y \sim b_0 + b_1 \cdot x_1 + b_2 \cdot x_2  \ldots + b_p \cdot x_p+ e$$
<br>
Our expected value $\hat{y}$ is based on the model estimates $\hat{b}$
<br>
$$\hat{y} = \hat{b}_0 + \hat{b}_1 \cdot x_1 + \hat{b}_2 \cdot x_2  \ldots + \hat{b}_p \cdot x_p$$

---

## Interpretation

The coefficients/weights $b$ tell us something about the nature of the relationship of some $x$ to $y$

In a standard linear regression model, they tell us how much $y$ moves, when $x$ moves by 1
- for categorical variables, this is changing from one level to the next

$e$ represents the things we don't understand that still influence $y$ (residual error)
- We try to minimize the error in prediction

---

## A Simple Linear Regression

```{r simple_lm, echo=FALSE, eval=TRUE}
set.seed(123)
x = runif(100)
y = 2 + .3*x + rnorm(100, sd=.1)
# coef(lm(y~x))

ggplot(data.frame(x,y), aes(x,y)) +
  geom_point() +
  geom_smooth(method = 'lm', se = F) +
  annotate(geom ='text', x= .75, y = 1.9, label=  "paste(\"yhat = 2 + .3*x   \", italic(R) ^ 2, \" = .43\")", parse=T) 
```

---

## Example

The basic R formula goes `y ~ x + z`

So if we want to see the award amount predicted by whether someone is a library user or not, we'd do the following.


```{r model1}
model_1 = lm(award_total_amount ~ libuser, 
             data = model_variables)
```

---

## Example

With the model object we use functions like:
- <span class="func">summary</span>: summarize the model
- <span class="func">predict</span>: predict with new observations
- <span class="func">coef</span>: extract coefficients
- <span class="func">fitted</span>: examine predictions on current data
- <span class="func">residual</span>: examine errors on current data

---

## Example

```{r model1_repeat, eval=TRUE}
model_1 = lm(award_total_amount ~ libuser, 
             data = model_variables)
summary(model_1)
```


---


## Example

<div style='font-size:60%'>

The following shows that becoming a libuser will increase the amount.

The standard error (in parenthesis) gives us some idea of the uncertainty in the coefficient.

The p-value suggests this is a statistically significant effect.
<br>

</div>

<span style="height: 150%"></span>

```{r model1_show, echo=FALSE, eval=TRUE, results='asis'}
model_1 = lm(award_total_amount ~ libuser, 
             data = model_variables)
texreg::htmlreg(model_1, 
                digits = 2, 
                doctype = F, 
                caption = '', 
                single.row = T, center = T)
```


---

## Performance

We can assess performance:
- within a model (e.g. Adjusted $R^2$)
- by comparing models (e.g. anova test)
- with prediction on new data (e.g. RMSE)

```{r reg_model_comparison}
model_2 = lm(award_total_amount ~ libuser + gender, 
             data = model_variables)

anova(model_1, model_2)
```


---

## Performance

The following tells us that adding gender significantly improves the model

- Adding gender significantly reduced the sum of the squared errors (residual)

```{r reg_model_comparison2, echo=FALSE, eval=TRUE}
model_2 = lm(award_total_amount ~ libuser + gender, 
             data = model_variables)

anova(model_1, model_2)
```


---

## Classification

All of the previous assumes a normal distribution for $y$

$$y \sim  \mathcal{N}(X\beta, \sigma^2)$$

In many cases, this is not the best choice.

<span class="emph">Logistic regression</span> is the most common alternative (of many)
- Binary outcome
- Allows us to classify observations into groups

---

## Interpretation

The conceptual model is the same:
- $y$ is still a function of the covariates $x$
- except $y$ is now a binary indicator of class membership
- the distribution assumed is binomial
- the variance is predetermined by the binomial distribution

Raw coefficients are interpreted the same...

... but they don't reflect the observed 0-1 outcome

Instead they regard an underlying continuum between the two classes, which is then converted to the probability of being in one of the classes

---

## Example

```{r model_logreg}
model_logreg = glm(award_total_amount > 5e6 ~ age, 
                   data = model_variables,
                   family = binomial)
```


```{r model_logreg_show, echo=FALSE, eval=TRUE, results='asis'}
model_logreg = glm(award_total_amount_high ~ age, 
                   data = model_variables %>% mutate(award_total_amount_high = factor(award_total_amount > 5e6, labels = c('low', 'high'))),
                   family = binomial)

texreg::htmlreg(model_logreg, 
                digits = 2, 
                doctype = F, 
                caption = '', 
                single.row = T, center = T)
```

---

## Interpretation

```{r logreg_plot, echo=FALSE, eval=TRUE}
init = ggeffects::ggpredict(model_logreg, terms='age')

ggplot(init, aes(x, predicted)) +
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) + 
  labs(y = 'Probability \naward is "high"', x = 'age') + 
  theme(axis.title.y = element_text(angle = 0, size=10, hjust = 0))
```


---

## Performance

With classification we are often interested in accuracy.

If we classify any observation with a probability > .5 as being 'high', how well do we match our target variable?

```{r accuracy}
model_prediction = predict(model_logreg, type = 'response') > .5
target = model_variables$award_total_amount > 5e6

mean(model_prediction == target)
mean(target) # guessing
```


---

## Performance

There are many other metrics besides accuracy

Depending on the situation you may prefer others

- <span class="emph">Sensitivity</span> (true positive rate)
- <span class="emph">Specificity</span> (true negative rate)
- <span class="emph">Positive Predictive Value</span>
- <span class="emph">Negative Predictive Value</span>
- <span class="emph">Area Under a ROC</span>
- Many others


---


## Other common techniques

Including interactions
- variable effects may depend on other predictors

Generalized Linear Models
- include the regressions models we've seen 
- Count models
- Other distributions

Beyond the GLM family of distributions
- Beta for data between 0,1
- Zero-inflated/hurdle models

---

## Other common techniques

Clustered and longitudinal data
- Mixed models

Spatial

Time Series

Survival/Event history/Time to event

Dimension reduction & Measurement models
- PCA
- Cluster analysis
- Factor analysis/SEM


---


## Tips & Tricks

Garbage in, garbage out
- Data must be worthwhile
- Spend the necessary time there

Start simply, then build up complexity
- Don't start with your final model
- Modeling is similar to debugging

Better programming means more reproducible
- Better data science
- More efficient data exploration

---

## Tips & Tricks

Don't fish, p-values are not that important

Create lists and apply functions to them
- Allows for more rapid exploration
- Speed through parallelization if needed

Rely heavily on visualization 
- Helps with your own interpretation
- Engages the audience more than a table of numbers