workshop.Rmd

---
title: "R Workshop"
author: "Cole Beck, Shawn Garbett"
date: "5/09/2022"
output:
  html_document: default
  rmdformats::readthedown:
    thumbnails: false
    lightbox: true
    gallery: true
    highlight: tango
    use_bookdown: true
    toc_depth: 3
    fig_caption: true
---

# Foundations

*Does everyone have a working laptop?*

This course is designed to bring one up to speed on the tools necessary to follow all the examples in Frank Harrell's ``Regression Modeling Strategies''. It assumes a rudimentary knowledge of R or at least having some experience with coding.

If there are any questions or areas you'd like covered in more detail please do not hesitate to speak up.

RStudio has multiple panes. One can customize what's open at any point. In general the four quadrants are:

* Upper Left: Open Files
* Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
* Lower Left: Console. The interactive command to the running R interpreter.
* Lower Right: Directory browser, Plots, Packages, Help. 

If one accidentally closes one, it's easily opened again from the `View` menu. 

Note Menu items have underlines for certain characters. These are shortcuts for `Alt-`*letter* that one can access quickly from the keyboard.

## Creating scripts

R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font. 

From RStudio, select `File -> New File` and there is a large number of options. The important ones for this class are:

* R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click `source` in the upper right of the file and it will execute in the console below.
* R Notebook. An older version that allows for a preview of the document. The preview is rebuilt on *every* save.
* R Markdown. Same as R Notebook, but requires a manual `Knit` to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.

A common question is "Which file type should I choose?" It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. For writing code that will import data or perform data manipulation, I prefer R scripts. For teaching with interactive examples, I prefer using R notebook. I like to use R markdown when creating output that will be shared (either as HTML or PDF).

## Creating projects

A project brings together several scripts as a project. One can create a project using `File -> New Project` and select an existing directory or create a blank one as desired. Existing projects can be opened with `File -> Open Project`.

## Session & Working Directory

There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document's environment. The `Run -> Restart R and Run All Chunks` restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments. 

It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command `sessionInfo()` provides a nice overview of this that is good for checking package versions loaded.

```{r sessioninfo}
sessionInfo()
```

The current working directory is available via `getwd()` and settable via `setwd("directory")`. All file loads and saves are relative to the current working directory.

## Environment

R maintains "environment frames" which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.

```{r environment}
x <- 3
f <- function()
{ # One can view the parenthesis as capturing a frame
  x <- 4   # Local variable inner frame
  cat("This is the inner 'x'", x, "\n") 
}
f()
cat("This is the outer 'x'", x, "\n")
```

## Packages

Install the packages `Hmisc, rms, knitr, rmarkdown`, and `plotly` through RStudio.

Packages contain R code and may often require the installation of other packages.  When I install `Hmisc`, I see that its dependency `acepack` is automatically installed:

```
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb

trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb

* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
```

### Install and updating via CRAN

Install through CRAN repositories

```{r, eval=FALSE}
# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown', 'plotly'))
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)
```

### Install and update via Github

Github is a code sharing platform. Packages are typically in a *rawer* state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the `remotes` package.

```{r, eval=FALSE}
install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')
```

### Loading Packages

Note that `library` is a bit of misnomer as R uses packages, not libraries.  From a technical standpoint, it's nice to recognize the distinction.  You may see `require` used in its place.  If the package is not installed, `library` will trigger an error while `require` will return FALSE.  Once loaded the functions created in the package are available to your R session.

```{r, eval=FALSE}
library(Hmisc)
require(Hmisc)
```

## Help

From the R console one can get help about any function by appending it with a `?`. 

```{r, eval=FALSE}
?lm
```

The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.

## Rmarkdown

Earlier versions of this course focused on LaTeX document generation. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Given that the focus of this course is on Regression Modeling strategies and not document generation, the current focus is on using Rmarkdown and showing some interactive plots via HTML. Full support and possibilities are shown at [https://rmarkdown.rstudio.com/](https://rmarkdown.rstudio.com/). 

The cheat sheet is available at [https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf), or through Rstudio's menu bar: `Help -> Cheetsheets -> R Markdown Cheet Sheet`.

[Quarto][quarto] is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative, though it must be installed in addition to R and RStudio.

## Best Practices

A blog post [RMarkdown Driven Development][reiderer] by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.

One takeaway is to avoid the general pitfall of putting large amounts of data processing *inside* ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts *outside* the report. The report should load the results of this process.

Another great reference is Dr. Harrell's [analysis project workflow][rflow]. His content mirrors much of this workshop and will reinforce its material. It can also be used as a template for quarto as its source code is available (by clicking the "</> Code" link).

## Caching

Chunks in Rmarkdown can be cached. I.e., the results of the last computation are saved and it's not rerun until a manual override by clicking on that section is done. It is recommended that one *not* do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes. 

A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It's very similar to caching, but is not prone to later confusion as it's a bit more direct that the report is loading the processed data file. In short do large data processing tasks *outside* of reports.

## Harrell's RMS

See chapters 1 and 9 of [Biostatistics for Biomedical Research][bbr].

## Setup

Most Rmarkdown documents have a setup section that loads required libraries and sets up some options. The following snippet is the basic setup for this course. It loads the `Hmisc` package. Sets `knitr` to use markdown by default. Sets the graphics option to use `plotly`. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable `mu`.

```{r setup,results='hide'}
require(Hmisc)
knitrSet(lang='markdown', h=4.5, fig.path='png/', fig.align='left')
# knitrSet redirects all messages to messages.txt
options(grType='plotly') # for certain graphics functions
mu <- markupSpecs$html   # markupSpecs is in Hmisc
```

## Data

The `getHdata` function is used to fetch a dataset from the Vanderbilt Biostatistics [DataSets][datasets] web site.  `upData` is used to

- create a new variable from an old one
- add labels to 2 variables
- add units to the new variable
- remove the old variable
- automatically move units of measurements from parenthetical expressions in labels to separate `units` attributes used by `Hmisc` and `rms` functions for table making and graphics

`contents` is used to print a data dictionary, run through an `html` method for nicer output.

```{r metadata,results='asis'}
getHdata(pbc)
pbc <- upData(pbc,
              fu.yrs = fu.days / 365.25,
              labels = c(fu.yrs = 'Follow-up Time',
                         status = 'Death or Liver Transplantation'),
              units = c(fu.yrs = 'year'),
              drop  = 'fu.days',
              moveUnits=TRUE, html=TRUE)
html(contents(pbc), maxlevels=10, levelType='table')
```

## Exercise 1

Do exercise 1. Pull the example exercise.Rmd from github as your starting template.

## Descriptive Statistics Without Stratification

```{r describe,results='asis'}
# did have results='asis' above
d <- describe(pbc)
html(d, size=80, scroll=TRUE)

p <- plot(d)
# "p" has two named elements that we can access with "$"
p$Categorical # or, p[['Categorical']]
p$Continuous
```

## Stratified Descriptive Statistics

Produce stratified quantiles, means/SD, and proportions by treatment group.  Plot the results before rendering as an advanced HTML table:

- categorical variables: a single dot chart
- continuous variables: a series of extended box plots

```{r summaryM}
s <- summaryM(bili + albumin + stage + protime + sex + age + spiders +
              alk.phos + sgot + chol ~ drug, data=pbc,
							overall=FALSE, test=TRUE)
```

# Plotting

```{r summaryM4plot}
plot(s, which='categorical')
plot(s, which='continuous', vars=1:4)
plot(s, which='continuous', vars=5:7)
```

```{r summaryMhtml, results='asis'}
html(s, caption='Baseline characteristics by randomized treatment',
     exclude1=TRUE, npct='both', digits=3,
     prmsd=TRUE, brmsd=TRUE, msdsize=mu$smaller2)
```

## Spike Histogram

Show data for continuous variable stratified by treatment.

```{r histbox,results='asis'}
p <- with(pbc, histboxp(x=sgot, group=drug, sd=TRUE))
p
```

## Data Visualization

The very low birthweight data set contains data on 671 infants born with a birth weight of under 1600 grams.  We'll plot gestational age by birthweight using three graphics systems: base graphics, ggplot, and plotly.

```{r}
getHdata(vlbw)
# remove missing values
vlbw <- vlbw[complete.cases(vlbw[,c('sex','dead','gest','bwt')]),]
```

## Base Graphics Example

Empty canvas - add each element into your plot.

```{r, fig.width = 6}
# "grps" is a list with each combination of sex/dead
grps <- split(vlbw[,c('gest','bwt')], vlbw[,c('sex','dead')])
# default "plot" function; set up parameters but with no lines/points because type 'n'
plot(c(22,40), c(400,1600), type='n', xlab='Gestational Age', ylab='Birth Weight (grams)', axes=FALSE)
# axis 1 is below plot
axis(1, at=c(22,28,34,40), labels=c(22,28,34,40))
# axis 2 is left of plot
axis(2, at=seq(400,1600,by=400), labels=seq(400,1600,by=400))
## four different methods to add points to plot area
# data.frame with two columns: 1st column is associated with x variable
points(grps[['female.0']], col='black', pch=1)
# two arguments: first vector is associated with x variable
points(grps[['female.1']][,'gest'], grps[['female.1']][,'bwt'], col='gray', pch=0)
# like above, but jitter adds noise so points have less overlap
points(jitter(grps[['male.0']][,'gest'], 2), grps[['male.0']][,'bwt'], col='black', pch=2)
# LHS ~ RHS: with tilde, the left hand side is associated with y variable
points(grps[['male.1']][,'bwt'] ~ jitter(grps[['male.1']][,'gest'], 2), col='gray', pch=3)
legend(x=37, y=1000, legend=c('F:alive','F:dead','M:alive','M:dead'), col=c('black','gray','black','gray'), pch=c(1,0,2,3))
```

## ggplot2 Example

Given a data set, choose the aesthetic mapping and geometry layer.

```{r}
# `as.factor(dead)` makes categorical variable rather than continuous
p <- ggplot(data=vlbw) + aes(x=gest, y=bwt, color=sex, shape=as.factor(dead)) + geom_point()
p
```

```{r}
p <- p + geom_jitter() + scale_x_continuous() + scale_y_continuous()
p
```

## Plotly

Add interactive graphics, which is trivial when also using `ggplot`.

```{r}
require(plotly)
ggplotly(p)
```

```{r}
# plotly uses pipe operator (%>%) from magrittr package
# in this context it works much like "+" does with ggplot
plot_ly(type="box") %>%
  add_boxplot(y = grps[['female.0']][,'bwt'], name='F:0') %>%
  add_boxplot(y = grps[['female.1']][,'bwt'], name='F:1') %>%
  add_boxplot(y = grps[['male.0']][,'bwt'], name='M:0') %>%
  add_boxplot(y = grps[['male.1']][,'bwt'], name='M:1')
```

# Models 

## Cardiovascular risk factor data

```{r}
getHdata(diabetes)
html(contents(diabetes), levelType='table')
```

## Formulas

The tilde is used to create a model formula, which consists of a left-hand side and right-hand side.  Many R functions utilize formulas, such as plotting functions and model-fitting functions.  The left-hand side consists of the response variable, while the right-hand side may contain several terms.  You may see the following operators within a formula.

* y ~ a + b, `+` indicates to include both a and b as terms
* y ~ a:b, `:` indicates the interaction of a and b
* y ~ a*b, equivalent to y ~ a+b+a:b
* y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
* y ~ a + I(b+c), `I` indicates to use `+` in the arithmetic sense

See ?formula for more examples.

## Model Fitting

The most simple model-fitting function is `lm`, which is used to fit linear models.  It's primary argument is a formula.  Using the diabetes data set, we can fit waist size by weight.

```{r}
m <- lm(waist ~ weight, data=diabetes)
```

This creates an `lm` object, and several functions can be used on model objects.  The internal structure of a model object is a list - its elements may be accessed just like a list.

* coef
* fitted
* predict
* residuals
* vcov
* summary
* anova

```{r}
names(m)
# can access named elements with "$"
m$coefficients
coef(m)
head(fitted(m))
# defaut "predict" gives same results
# sum(abs(fitted(m) - predict(m)) > 1e-8)
predict(m, data.frame(weight=c(150, 200, 250)))
head(residuals(m))
vcov(m) # variance-covariance matrix
# "summary" produces additional information about model
summary(m)
coef(summary(m))
anova(m)
```

## Visualization

Create a scatterplot of weight and waist size.

```{r}
# `qplot` can create default ggplot output
p <- qplot(weight, waist, data=diabetes)
p + geom_smooth(method="lm")
# the above uses the implied formula y~x
# p + geom_smooth(method="lm", formula = y ~ x)
```

Remove missing values and fit with LOESS curve.

```{r}
diab <- diabetes[complete.cases(diabetes[,c('waist','weight')]),]
p <- qplot(weight, waist, data=diab)
p + geom_smooth(method="loess")
```

## Data Manipulation

The Dominican Republic Hypertension dataset contains data on gender, age, and systolic and diastolic blood pressure from several villages in the DR.

```{r}
getHdata(DominicanHTN)
html(describe(DominicanHTN))
```

## Adding variables or transformations

The data.frame bracket syntax is data.frame[row_selection, column_selection].

```{r}
# create "mean arterial pressure"
DominicanHTN[,'map'] <- (DominicanHTN[,'sbp'] + DominicanHTN[,'dbp'] * 2) / 3
# use `as.numeric` to convert TRUE/FALSE into 1/0
DominicanHTN[,'male'] <- as.numeric(DominicanHTN[,'gender'] == 'Male')
DominicanHTN[1:5,]
```

In this example, I collapse a variable into smaller bins - which should generally be avoided.

```{r}
qage <- quantile(DominicanHTN[,'age'])
qage[1] <- qage[1]-1
qage
DominicanHTN[,'ageGrp'] <- cut(DominicanHTN[,'age'], breaks=qage)
# could easily specify other breakpoints
# levels(cut(DominicanHTN[,'age'], breaks = c(0, 18, 100)))
# levels(cut(DominicanHTN[,'age'], breaks = seq(10, 100, by = 10)))
```

## Filter

```{r}
nrow(DominicanHTN)
nrow(DominicanHTN[DominicanHTN[,'gender'] == "Male",])
which(DominicanHTN[,'village'] %in% c('Carmona','San Antonio'))
cojMen <- DominicanHTN[DominicanHTN[,'village'] == 'Cojobal' & DominicanHTN[,'gender'] == "Male",]
cojMen
```

## Sort

```{r}
order(cojMen[,'age'], decreasing=TRUE) # first value is element number with highest age
# use row numbers to re-order a data.frame
cojMen[order(cojMen[,'age'], decreasing=TRUE),]
```

## Aggregate and counts

```{r}
table(DominicanHTN[,'gender'])
```

```{r}
addmargins(with(DominicanHTN, table(village, gender)))
```

```{r}
with(DominicanHTN, tapply(age, village, mean))
```

```{r}
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
```

```{r}
aggregate(age ~ village + gender, DominicanHTN, median)
# easy to change statistical measure, or function
# aggregate(age ~ village + gender, DominicanHTN, max)
```

## Exercise 2 & 3

Do exercise 2 & 3.

## Rosner's lead data

The `lead` exposure dataset was collected to study the well-being of children who lived near a lead smelting plant.  The following analysis considers the lead exposure levels in 1972 and 1973, as well as age and the finger-wrist tapping score `maxfwt`.

```{r contents,results='asis'}
getHdata(lead)
lead <- upData(lead,
       keep = c('ld72','ld73','age','maxfwt'),
       labels = c(age = 'Age'),
       units = c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))
html(contents(lead), maxlevels=10, levelType='table')
```

```{r}
html(describe(lead))
```

# Introduction to RMS (Ordinary least squares regression)

The `rms` package includes the `ols` fitting function.  Use `datadist` to compute summaries of the distributional characteristics of the predictors - or simply give it an entire data frame.  `datadist` must be re-run if you add a new predictor or recode an old one.

Note: The `ols` function is *not* covered in the main course. Generalized least squares is. Ordinary least squares does not handle data that has correlation between residuals. It is covered here to give a simple overview of the structure of models in RMS. 

```{r}
require(rms)
dd <- datadist(lead)
options(datadist='dd')
f <- ols(maxfwt ~ age + ld72 + ld73, data=lead)
f
```

`rms` provides many methods to work with the `ols` fit object.

* coef
* fitted
* predict
* resid
* anova
* summary
* Predict
* ggplot
* Function
* nomogram

## Coefficient Estimates

```{r}
coef(f)
```

## Define R function representing fitted model

The default values for function arguments are medians.

```{r}
g <- Function(f)
g
g(age=10, ld72=21, ld73=c(21, 47))
```

## Predicted values with standard errors

```{r}
predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)
```

## Residuals

```{r}
r <- resid(f)
par(mfrow=c(2,2))
plot(fitted(f), r); abline(h=0)
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
# q-q plot to check normality
qqnorm(r)
qqline(r)
```

## Plotting partial effects

`Predict` and `plotp` make one plot for each predictor.  Predictors not shown in plot are set to constants (continuous:median, categorical:mode).  0.95 pointwise confidence limits for $\hat{E}(y|x)$ are shown; use `conf.int=FALSE` to suppress CLs.

```{r}
plotp(Predict(f))
```

Specify which predictors are plotted.

```{r}
plotp(Predict(f, age))
```

```{r}
plotp(Predict(f, age=3:15))
```

```{r}
plotp(Predict(f, age=seq(3,16,length=150)))
```

Obtain confidence limits for $\hat{y}$.

```{r}
plotp(Predict(f, age, conf.type='individual'))
```

Combine plots.

```{r}
p1 <- Predict(f, age, conf.int=0.99, conf.type='individual')
p2 <- Predict(f, age, conf.int=0.99, conf.type='mean')
p <- rbind(Individual=p1, Mean=p2)
ggplot(p)
```

```{r}
plotp(Predict(f, ld73, age=3))
```

```{r}
ggplot(Predict(f, ld73, age=c(3,9)))
```

3-d surface for two continuous predictors against $\hat{y}$.

```{r, fig.width=6}
bplot(Predict(f, ld72, ld73))
```

## Nomograms

```{r, fig.height=6, fig.width=6}
plot(nomogram(f))
```

## Point Estimates for Partial Effects

```{r}
summary(f)
```

Adjust `age` to 5, which has no effect as the model does not include any interactions.

```{r}
summary(f, age=5)
```

Effect of changing `ld73` from 20 to 40.

```{r}
summary(f, ld73=c(20,40))
```

If a predictor has a linear effect, a one-unit change can be used to get the confidence interval of its slope.

```{r}
summary(f, age=5:6)
```

There is also a `plot` method for `summary` results.

```{r}
plot(summary(f))
```

## Getting Predicted Values

Give `predict` a fit object and `data.frame`.

```{r}
predict(f, data.frame(age=3, ld72=21, ld73=21))
```

```{r}
predict(f, data.frame(age=c(3,10), ld72=21, ld73=c(21,47)))
```

```{r}
newdat <- expand.grid(age=c(4,8), ld72=c(21, 47), ld73=c(21, 47))
newdat
```

```{r}
predict(f, newdat)
```

Include confidence levels.

```{r}
predict(f, newdat, conf.int=0.95)
predict(f, newdat, conf.int=0.95, conf.type='individual')
```

Brute-force predicted values.

```{r}
b <- coef(f)
b[1] + b[2]*3 + b[3]*21 + b[4]*21
```

Predicted values with `Function`.

```{r}
# g <- Function(f)
g(age = c(3,8), ld72 = 21, ld73 = 21)
g(age = 3)
```

## ANOVA

Use `anova` to get all total effects and individual partial effects.

```{r}
an <- anova(f)
an
```

Include corresponding variable names.

```{r}
print(an, 'names')
print(an, 'subscripts')
print(an, 'dots')
```

Combined partial effects.

```{r}
anova(f, ld72, ld73)
```

## Exercise 4

Time for exercise 4.

# Cleaning and Importing Data

## Text files

* read.table - work with data in table format
* read.csv - CSV files
* read.delim - tab-delimited files
* scan - more general function for reading in data

## Binary representation of R objects

Compressed data and loads faster.

* RData
* `feather` package

## Database connections

Allows data queries.

* `RODBC` package
* `RSQLite` package

## Other statistical packages

* `Hmisc` package: `sas.get`, `sasxport.get`
* `haven` package: `read_sas`, `read_sav`, `read_dta`
* `foreign` package
* Stat/Transfer - not free nor open source

## Data Cleaning

An R `data.frame` consists of a collection of values.  In a well-structured data set, each value is associated with a variable (column) and observation (row).  Data sets often need to be manipulated to become well-structured.

Hadley Wickham's [Tidy Data][tidy] provides an excellent overview on how to clean messy data sets.

## Column header contains values

```{r}
# politics and religion
senateReligion <- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
                          Democrats=c(20,15,8,1,1,3),
                          Republicans=c(38,9,0,5,0,0))
senateReligion
```

```{r}
senRel <- cbind(senateReligion[,'religion'], stack(senateReligion[,c('Democrats','Republicans')]))
names(senRel) <- c('religion','count','party')
senRel
```

## Multiple variables in a column

```{r}
pets <- data.frame(county=c('Davidson','Rutherford','Cannon'),
                   male.dog=c(50,150,70), female.dog=c(60,150,70),
                   male.cat=c(30,100,50), female.cat=c(30,70,40),
                   male.horse=c(6,30,20), female.horse=c(6,28,19))
pets
```

```{r}
pets2 <- cbind(pets[,'county'], stack(pets[,-1]))
names(pets2) <- c('county','count','ind')
pets2
```

Watch out for factor variables.

```{r}
tryCatch(strsplit(pets2[,'ind'], '\\.'), error=function(e){ e })
```

First `for` loop, allows iterating over a sequence.

```{r}
sexAnimal <- strsplit(as.character(pets2[,'ind']), '\\.')
pets2[,c('sex','animal')] <- NA
for(i in seq(nrow(pets2))) {
  pets2[i, c('sex','animal')] <- sexAnimal[[i]]
}
pets2[,'ind'] <- NULL
pets2
```

## Variables in a row

```{r}
set.seed(1000)
d1 <- rnorm(1000, 5, 2)
d2 <- runif(1000, 0, 10)
d3 <- rpois(1000, 5)
d4 <- rbinom(1000, 10, 0.5)
x <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
           stat=rep(c('mean','sd'),4),
           value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x
```

```{r}
cbind(distribution=x[c(1,3,5,7),'distribution'], unstack(x, form=value~stat))
```

## Normalization and merging

```{r}
employees <- data.frame(id=1:4, Name=c('Eddie','Andrea','Steve','Theresa'),
                        job=c('engineer','accountant','statistician','technician'))
set.seed(11)
hours <- data.frame(id=sample(4, 10, replace=TRUE),
                    week=1:10, hours=sample(30:60, 10, replace=TRUE))
merge(employees, hours)
# merge(employees, hours, by.x='id', by.y='id')
```

```{r}
empHours <- merge(employees, hours, all.x=TRUE)
empHours
```

```{r}
unique(empHours[,c('id','Name','job')])
empHours[,c('id','week','hours')]
```

Exclude rows with missing data.

```{r}
empHours[!is.na(empHours[,'week']), c('id','week','hours')]
```

## Combining data

```{r}
rbind(sexAnimal[[1]], sexAnimal[[2]], sexAnimal[[3]])
```

```{r}
do.call(rbind, sexAnimal)
```

## Exercise 5

Time for exercise 5.

# Simulation

Plot mean while sampling from normal distribution

```{r}
n <- 100
imean <- numeric(n)
smean <- numeric(n)
for(i in seq(n)) {
  imean[i] <- mean(rnorm(10))
  smean[i] <- mean(imean[seq(i)])
}
plot(imean, ylab='Sample Mean')
abline(h=0, lty=3)
lines(smean)
```

```{r}
n <- 1000
sig <- numeric(n)
for(i in seq(n)) {
  grp1 <- rnorm(15, 60, 5)
  grp2 <- rnorm(15, 65, 5)
  sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)
```

Add some flexibility to our simulation by creating a function.

```{r}
tTestPower <- function(n=15, mu1=60, mu2=65, sd=5, nsim=1000) {
  sig <- numeric(nsim)
  for(i in seq(nsim)) {
    grp1 <- rnorm(n, mu1, sd)
    grp2 <- rnorm(n, mu2, sd)
    sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
  }
  mean(sig)
}
tTestPower()
tTestPower(25, nsim=10000)
```

There's already a function to compute the power of a two-sample t test.

```{r}
power.t.test(n=25, delta=5, sd=5)$power
```

# Simulation and Bootstrapping

Simulation is a wonderful tool for really understanding just how well a method performs. Sometimes the modeling and techniques are complex enough that analytical and numerical properties are difficult to ascertain. Simulation can help answer that question by doing a procedure on random data repeatedly and examining how well the technique performs.

Bootstrap sampling is a related area that takes multiple random samples from a data set and performs estimates of statistical properties. It is robust in that it makes few assumptions about the structure of the data.

## Bootstrap sample

```{r}
true_mu <- 0
x <- rnorm(100, true_mu)
R <- 999
res <- matrix(nrow=R, ncol=6)
colnames(res) <- c('mu','se','lb','ub','coverage','bias')
for(i in seq(R)) {
  r <- sample(x, replace=TRUE)
  res[i,'mu'] <- mean(r)
  res[i,'se'] <- sd(r) / sqrt(length(r))
}
res[,'lb'] <- res[,'mu'] + qnorm(0.025) * res[,'se']
res[,'ub'] <- res[,'mu'] + qnorm(0.975) * res[,'se']
res[,'coverage'] <- res[,'lb'] < true_mu & res[,'ub'] > true_mu
res[,'bias'] <- res[,'mu'] - true_mu
```

```{r}
res[1:5,]
vals <- colMeans(res)
basicCI <- 2 * mean(x) - quantile(res[,'mu'], probs=c(0.975, 0.025))
out <- sprintf("empirical SE: %.3f
MSE: %.3f
basic bootstrap confidence interval: (%.3f, %.3f)
coverage probability: %.3f
mean bias: %.3f
sample mean: %.3f
", sd(res[,'mu']), vals['se'], basicCI[1], basicCI[2], vals['coverage'], vals['bias'], mean(x))
cat(out)
```

## Control Structures

Control structures are used to change if and when lines of code in a program are run.  The control comes from conditions being true or false.

## Root finding with the Newton-Raphson algorithm

The `for` loop has already been introduced.  Two other important control structures are `while` loops and `if` statements.

```{r}
f <- function(x) x^3 + x^2 - 3*x - 3
fp <- function(x) 3*x^2 + 2*x - 3
plot(f, xlim=c(-4,4))
```

```{r}
x <- -2
i <- 1
while(abs(f(x)) > 1e-8) {
  x <- x - f(x) / fp(x)
  i <- i + 1
  if(i > 20) {
    print("does not converge")
    break
  }
}
x
f(x)
```

## Exercise 6 & 7

Final exercises, 6 & 7.

# Worth Mentioning

## tidyverse

Wickham's "Tidy Data" philosophy was concurrent with the development of a suite of R packages useful for data management.  This includes `ggplot2` and `haven` as well as many others such as `dplyr`, `stringr`, and `tidyr`.

## data.table

`data.table` is another beneficial package for data manipulation.  It provides similar functionality to the `data.frame`, but works well with large data sets. It minimizes memory operations. It is best to develop ones code using either data.frame or data.table from the beginning as they are not exactly equivalent and surprises can happen upon dropping in one for the other.

# Computing Environment[^1] {#compenv}
`r mu$session()`

[^1]: `mu` is a copy of the part of the `Hmisc` package object `markupSpecs` that is for html.  It includes a function `session` that renders the session environment (including package versions) in html.

[quarto]: https://quarto.org/

[reiderer]: https://t.co/ZBk1OrTXYp

[rflow]: https://www.fharrell.com/post/rflow

[datasets]: https://hbiostat.org/data/ "DataSets"

[tidy]: http://vita.had.co.nz/papers/tidy-data.pdf "Tidy Data"

[bbr]: http://hbiostat.org/doc/bbr.pdf "Biostatistics for Biomedical Research"