-
Notifications
You must be signed in to change notification settings - Fork 5
/
workshop.Rmd
1027 lines (756 loc) · 31 KB
/
workshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "R Workshop"
author: "Cole Beck, Shawn Garbett"
date: "5/09/2022"
output:
html_document: default
rmdformats::readthedown:
thumbnails: false
lightbox: true
gallery: true
highlight: tango
use_bookdown: true
toc_depth: 3
fig_caption: true
---
# Foundations
*Does everyone have a working laptop?*
This course is designed to bring one up to speed on the tools necessary to follow all the examples in Frank Harrell's ``Regression Modeling Strategies''. It assumes a rudimentary knowledge of R or at least having some experience with coding.
If there are any questions or areas you'd like covered in more detail please do not hesitate to speak up.
RStudio has multiple panes. One can customize what's open at any point. In general the four quadrants are:
* Upper Left: Open Files
* Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
* Lower Left: Console. The interactive command to the running R interpreter.
* Lower Right: Directory browser, Plots, Packages, Help.
If one accidentally closes one, it's easily opened again from the `View` menu.
Note Menu items have underlines for certain characters. These are shortcuts for `Alt-`*letter* that one can access quickly from the keyboard.
## Creating scripts
R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.
From RStudio, select `File -> New File` and there is a large number of options. The important ones for this class are:
* R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click `source` in the upper right of the file and it will execute in the console below.
* R Notebook. An older version that allows for a preview of the document. The preview is rebuilt on *every* save.
* R Markdown. Same as R Notebook, but requires a manual `Knit` to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.
A common question is "Which file type should I choose?" It never hurts to start with a basic R script, as once the code is written it can be sourced or even copied and pasted into one of the other documents. For writing code that will import data or perform data manipulation, I prefer R scripts. For teaching with interactive examples, I prefer using R notebook. I like to use R markdown when creating output that will be shared (either as HTML or PDF).
## Creating projects
A project brings together several scripts as a project. One can create a project using `File -> New Project` and select an existing directory or create a blank one as desired. Existing projects can be opened with `File -> Open Project`.
## Session & Working Directory
There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document's environment. The `Run -> Restart R and Run All Chunks` restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.
It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command `sessionInfo()` provides a nice overview of this that is good for checking package versions loaded.
```{r sessioninfo}
sessionInfo()
```
The current working directory is available via `getwd()` and settable via `setwd("directory")`. All file loads and saves are relative to the current working directory.
## Environment
R maintains "environment frames" which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.
```{r environment}
x <- 3
f <- function()
{ # One can view the parenthesis as capturing a frame
x <- 4 # Local variable inner frame
cat("This is the inner 'x'", x, "\n")
}
f()
cat("This is the outer 'x'", x, "\n")
```
## Packages
Install the packages `Hmisc, rms, knitr, rmarkdown`, and `plotly` through RStudio.
Packages contain R code and may often require the installation of other packages. When I install `Hmisc`, I see that its dependency `acepack` is automatically installed:
```
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb
* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
```
### Install and updating via CRAN
Install through CRAN repositories
```{r, eval=FALSE}
# Having code to install packages insider your Rmarkdown is bad practice. Don't do this.
# This code is marked `eval=FALSE` in the Rmarkdown chunk so it is not evaluated.
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown', 'plotly'))
# update all packages
update.packages(checkBuilt = TRUE, ask = FALSE)
```
### Install and update via Github
Github is a code sharing platform. Packages are typically in a *rawer* state, but it can be important to use sometimes to get the latest bugfix at the risk of introducing a different bug. Installing through Github repositories is done using the `remotes` package.
```{r, eval=FALSE}
install.packages('remotes')
require(remotes)
install_github('harrelfe/rms')
# [package]::[function]() is a common syntax that shows what package is associated with the given function
# for example, this works like above but explicitly tells us we're using the "remotes" package
# remotes::install_github('harrelfe/rms')
```
### Loading Packages
Note that `library` is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it's nice to recognize the distinction. You may see `require` used in its place. If the package is not installed, `library` will trigger an error while `require` will return FALSE. Once loaded the functions created in the package are available to your R session.
```{r, eval=FALSE}
library(Hmisc)
require(Hmisc)
```
## Help
From the R console one can get help about any function by appending it with a `?`.
```{r, eval=FALSE}
?lm
```
The R maintainers are quite pedantic about requiring package help files on functions up to date, complete and written in proper English. The benefit is that the user now has a reference for just about every function including from community sourced packages.
## Rmarkdown
Earlier versions of this course focused on LaTeX document generation. LaTeX is a wonderful publishing tool (for PDF), that is also as complex as it is rich. Rmarkdown has come a long way to generating very rich documents that compile to either HTML or PDF with very little knowledge of the file formats of either. Given that the focus of this course is on Regression Modeling strategies and not document generation, the current focus is on using Rmarkdown and showing some interactive plots via HTML. Full support and possibilities are shown at [https://rmarkdown.rstudio.com/](https://rmarkdown.rstudio.com/).
The cheat sheet is available at [https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/master/rmarkdown-2.0.pdf), or through Rstudio's menu bar: `Help -> Cheetsheets -> R Markdown Cheet Sheet`.
[Quarto][quarto] is a next generation version of R Markdown from RStudio. Users are encouraged to evaluate it as an alternative, though it must be installed in addition to R and RStudio.
## Best Practices
A blog post [RMarkdown Driven Development][reiderer] by Emily Riederer is an excellent introduction to best practices in producing statistical documents using Rmarkdown.
One takeaway is to avoid the general pitfall of putting large amounts of data processing *inside* ones Rmarkdown reports, then minor changes to the report can require hours to compile. Large data processing and wrangling tasks should be done in scripts *outside* the report. The report should load the results of this process.
Another great reference is Dr. Harrell's [analysis project workflow][rflow]. His content mirrors much of this workshop and will reinforce its material. It can also be used as a template for quarto as its source code is available (by clicking the "</> Code" link).
## Caching
Chunks in Rmarkdown can be cached. I.e., the results of the last computation are saved and it's not rerun until a manual override by clicking on that section is done. It is recommended that one *not* do this as it inevitably leads to confusion when code or data changes are made and the document compilation does not reflect the changes.
A better approach for long computations is to do these with external scripts and save the results. The report can load these results and present as needed. It's very similar to caching, but is not prone to later confusion as it's a bit more direct that the report is loading the processed data file. In short do large data processing tasks *outside* of reports.
## Harrell's RMS
See chapters 1 and 9 of [Biostatistics for Biomedical Research][bbr].
## Setup
Most Rmarkdown documents have a setup section that loads required libraries and sets up some options. The following snippet is the basic setup for this course. It loads the `Hmisc` package. Sets `knitr` to use markdown by default. Sets the graphics option to use `plotly`. It also pulls some HTML rendering options from the markup specs for use later and stores this in a global variable `mu`.
```{r setup,results='hide'}
require(Hmisc)
knitrSet(lang='markdown', h=4.5, fig.path='png/', fig.align='left')
# knitrSet redirects all messages to messages.txt
options(grType='plotly') # for certain graphics functions
mu <- markupSpecs$html # markupSpecs is in Hmisc
```
## Data
The `getHdata` function is used to fetch a dataset from the Vanderbilt Biostatistics [DataSets][datasets] web site. `upData` is used to
- create a new variable from an old one
- add labels to 2 variables
- add units to the new variable
- remove the old variable
- automatically move units of measurements from parenthetical expressions in labels to separate `units` attributes used by `Hmisc` and `rms` functions for table making and graphics
`contents` is used to print a data dictionary, run through an `html` method for nicer output.
```{r metadata,results='asis'}
getHdata(pbc)
pbc <- upData(pbc,
fu.yrs = fu.days / 365.25,
labels = c(fu.yrs = 'Follow-up Time',
status = 'Death or Liver Transplantation'),
units = c(fu.yrs = 'year'),
drop = 'fu.days',
moveUnits=TRUE, html=TRUE)
html(contents(pbc), maxlevels=10, levelType='table')
```
## Exercise 1
Do exercise 1. Pull the example exercise.Rmd from github as your starting template.
## Descriptive Statistics Without Stratification
```{r describe,results='asis'}
# did have results='asis' above
d <- describe(pbc)
html(d, size=80, scroll=TRUE)
p <- plot(d)
# "p" has two named elements that we can access with "$"
p$Categorical # or, p[['Categorical']]
p$Continuous
```
## Stratified Descriptive Statistics
Produce stratified quantiles, means/SD, and proportions by treatment group. Plot the results before rendering as an advanced HTML table:
- categorical variables: a single dot chart
- continuous variables: a series of extended box plots
```{r summaryM}
s <- summaryM(bili + albumin + stage + protime + sex + age + spiders +
alk.phos + sgot + chol ~ drug, data=pbc,
overall=FALSE, test=TRUE)
```
# Plotting
```{r summaryM4plot}
plot(s, which='categorical')
plot(s, which='continuous', vars=1:4)
plot(s, which='continuous', vars=5:7)
```
```{r summaryMhtml, results='asis'}
html(s, caption='Baseline characteristics by randomized treatment',
exclude1=TRUE, npct='both', digits=3,
prmsd=TRUE, brmsd=TRUE, msdsize=mu$smaller2)
```
## Spike Histogram
Show data for continuous variable stratified by treatment.
```{r histbox,results='asis'}
p <- with(pbc, histboxp(x=sgot, group=drug, sd=TRUE))
p
```
## Data Visualization
The very low birthweight data set contains data on 671 infants born with a birth weight of under 1600 grams. We'll plot gestational age by birthweight using three graphics systems: base graphics, ggplot, and plotly.
```{r}
getHdata(vlbw)
# remove missing values
vlbw <- vlbw[complete.cases(vlbw[,c('sex','dead','gest','bwt')]),]
```
## Base Graphics Example
Empty canvas - add each element into your plot.
```{r, fig.width = 6}
# "grps" is a list with each combination of sex/dead
grps <- split(vlbw[,c('gest','bwt')], vlbw[,c('sex','dead')])
# default "plot" function; set up parameters but with no lines/points because type 'n'
plot(c(22,40), c(400,1600), type='n', xlab='Gestational Age', ylab='Birth Weight (grams)', axes=FALSE)
# axis 1 is below plot
axis(1, at=c(22,28,34,40), labels=c(22,28,34,40))
# axis 2 is left of plot
axis(2, at=seq(400,1600,by=400), labels=seq(400,1600,by=400))
## four different methods to add points to plot area
# data.frame with two columns: 1st column is associated with x variable
points(grps[['female.0']], col='black', pch=1)
# two arguments: first vector is associated with x variable
points(grps[['female.1']][,'gest'], grps[['female.1']][,'bwt'], col='gray', pch=0)
# like above, but jitter adds noise so points have less overlap
points(jitter(grps[['male.0']][,'gest'], 2), grps[['male.0']][,'bwt'], col='black', pch=2)
# LHS ~ RHS: with tilde, the left hand side is associated with y variable
points(grps[['male.1']][,'bwt'] ~ jitter(grps[['male.1']][,'gest'], 2), col='gray', pch=3)
legend(x=37, y=1000, legend=c('F:alive','F:dead','M:alive','M:dead'), col=c('black','gray','black','gray'), pch=c(1,0,2,3))
```
## ggplot2 Example
Given a data set, choose the aesthetic mapping and geometry layer.
```{r}
# `as.factor(dead)` makes categorical variable rather than continuous
p <- ggplot(data=vlbw) + aes(x=gest, y=bwt, color=sex, shape=as.factor(dead)) + geom_point()
p
```
```{r}
p <- p + geom_jitter() + scale_x_continuous() + scale_y_continuous()
p
```
## Plotly
Add interactive graphics, which is trivial when also using `ggplot`.
```{r}
require(plotly)
ggplotly(p)
```
```{r}
# plotly uses pipe operator (%>%) from magrittr package
# in this context it works much like "+" does with ggplot
plot_ly(type="box") %>%
add_boxplot(y = grps[['female.0']][,'bwt'], name='F:0') %>%
add_boxplot(y = grps[['female.1']][,'bwt'], name='F:1') %>%
add_boxplot(y = grps[['male.0']][,'bwt'], name='M:0') %>%
add_boxplot(y = grps[['male.1']][,'bwt'], name='M:1')
```
# Models
## Cardiovascular risk factor data
```{r}
getHdata(diabetes)
html(contents(diabetes), levelType='table')
```
## Formulas
The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.
* y ~ a + b, `+` indicates to include both a and b as terms
* y ~ a:b, `:` indicates the interaction of a and b
* y ~ a*b, equivalent to y ~ a+b+a:b
* y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
* y ~ a + I(b+c), `I` indicates to use `+` in the arithmetic sense
See ?formula for more examples.
## Model Fitting
The most simple model-fitting function is `lm`, which is used to fit linear models. It's primary argument is a formula. Using the diabetes data set, we can fit waist size by weight.
```{r}
m <- lm(waist ~ weight, data=diabetes)
```
This creates an `lm` object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.
* coef
* fitted
* predict
* residuals
* vcov
* summary
* anova
```{r}
names(m)
# can access named elements with "$"
m$coefficients
coef(m)
head(fitted(m))
# defaut "predict" gives same results
# sum(abs(fitted(m) - predict(m)) > 1e-8)
predict(m, data.frame(weight=c(150, 200, 250)))
head(residuals(m))
vcov(m) # variance-covariance matrix
# "summary" produces additional information about model
summary(m)
coef(summary(m))
anova(m)
```
## Visualization
Create a scatterplot of weight and waist size.
```{r}
# `qplot` can create default ggplot output
p <- qplot(weight, waist, data=diabetes)
p + geom_smooth(method="lm")
# the above uses the implied formula y~x
# p + geom_smooth(method="lm", formula = y ~ x)
```
Remove missing values and fit with LOESS curve.
```{r}
diab <- diabetes[complete.cases(diabetes[,c('waist','weight')]),]
p <- qplot(weight, waist, data=diab)
p + geom_smooth(method="loess")
```
## Data Manipulation
The Dominican Republic Hypertension dataset contains data on gender, age, and systolic and diastolic blood pressure from several villages in the DR.
```{r}
getHdata(DominicanHTN)
html(describe(DominicanHTN))
```
## Adding variables or transformations
The data.frame bracket syntax is data.frame[row_selection, column_selection].
```{r}
# create "mean arterial pressure"
DominicanHTN[,'map'] <- (DominicanHTN[,'sbp'] + DominicanHTN[,'dbp'] * 2) / 3
# use `as.numeric` to convert TRUE/FALSE into 1/0
DominicanHTN[,'male'] <- as.numeric(DominicanHTN[,'gender'] == 'Male')
DominicanHTN[1:5,]
```
In this example, I collapse a variable into smaller bins - which should generally be avoided.
```{r}
qage <- quantile(DominicanHTN[,'age'])
qage[1] <- qage[1]-1
qage
DominicanHTN[,'ageGrp'] <- cut(DominicanHTN[,'age'], breaks=qage)
# could easily specify other breakpoints
# levels(cut(DominicanHTN[,'age'], breaks = c(0, 18, 100)))
# levels(cut(DominicanHTN[,'age'], breaks = seq(10, 100, by = 10)))
```
## Filter
```{r}
nrow(DominicanHTN)
nrow(DominicanHTN[DominicanHTN[,'gender'] == "Male",])
which(DominicanHTN[,'village'] %in% c('Carmona','San Antonio'))
cojMen <- DominicanHTN[DominicanHTN[,'village'] == 'Cojobal' & DominicanHTN[,'gender'] == "Male",]
cojMen
```
## Sort
```{r}
order(cojMen[,'age'], decreasing=TRUE) # first value is element number with highest age
# use row numbers to re-order a data.frame
cojMen[order(cojMen[,'age'], decreasing=TRUE),]
```
## Aggregate and counts
```{r}
table(DominicanHTN[,'gender'])
```
```{r}
addmargins(with(DominicanHTN, table(village, gender)))
```
```{r}
with(DominicanHTN, tapply(age, village, mean))
```
```{r}
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
```
```{r}
aggregate(age ~ village + gender, DominicanHTN, median)
# easy to change statistical measure, or function
# aggregate(age ~ village + gender, DominicanHTN, max)
```
## Exercise 2 & 3
Do exercise 2 & 3.
## Rosner's lead data
The `lead` exposure dataset was collected to study the well-being of children who lived near a lead smelting plant. The following analysis considers the lead exposure levels in 1972 and 1973, as well as age and the finger-wrist tapping score `maxfwt`.
```{r contents,results='asis'}
getHdata(lead)
lead <- upData(lead,
keep = c('ld72','ld73','age','maxfwt'),
labels = c(age = 'Age'),
units = c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))
html(contents(lead), maxlevels=10, levelType='table')
```
```{r}
html(describe(lead))
```
# Introduction to RMS (Ordinary least squares regression)
The `rms` package includes the `ols` fitting function. Use `datadist` to compute summaries of the distributional characteristics of the predictors - or simply give it an entire data frame. `datadist` must be re-run if you add a new predictor or recode an old one.
Note: The `ols` function is *not* covered in the main course. Generalized least squares is. Ordinary least squares does not handle data that has correlation between residuals. It is covered here to give a simple overview of the structure of models in RMS.
```{r}
require(rms)
dd <- datadist(lead)
options(datadist='dd')
f <- ols(maxfwt ~ age + ld72 + ld73, data=lead)
f
```
`rms` provides many methods to work with the `ols` fit object.
* coef
* fitted
* predict
* resid
* anova
* summary
* Predict
* ggplot
* Function
* nomogram
## Coefficient Estimates
```{r}
coef(f)
```
## Define R function representing fitted model
The default values for function arguments are medians.
```{r}
g <- Function(f)
g
g(age=10, ld72=21, ld73=c(21, 47))
```
## Predicted values with standard errors
```{r}
predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)
```
## Residuals
```{r}
r <- resid(f)
par(mfrow=c(2,2))
plot(fitted(f), r); abline(h=0)
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
# q-q plot to check normality
qqnorm(r)
qqline(r)
```
## Plotting partial effects
`Predict` and `plotp` make one plot for each predictor. Predictors not shown in plot are set to constants (continuous:median, categorical:mode). 0.95 pointwise confidence limits for $\hat{E}(y|x)$ are shown; use `conf.int=FALSE` to suppress CLs.
```{r}
plotp(Predict(f))
```
Specify which predictors are plotted.
```{r}
plotp(Predict(f, age))
```
```{r}
plotp(Predict(f, age=3:15))
```
```{r}
plotp(Predict(f, age=seq(3,16,length=150)))
```
Obtain confidence limits for $\hat{y}$.
```{r}
plotp(Predict(f, age, conf.type='individual'))
```
Combine plots.
```{r}
p1 <- Predict(f, age, conf.int=0.99, conf.type='individual')
p2 <- Predict(f, age, conf.int=0.99, conf.type='mean')
p <- rbind(Individual=p1, Mean=p2)
ggplot(p)
```
```{r}
plotp(Predict(f, ld73, age=3))
```
```{r}
ggplot(Predict(f, ld73, age=c(3,9)))
```
3-d surface for two continuous predictors against $\hat{y}$.
```{r, fig.width=6}
bplot(Predict(f, ld72, ld73))
```
## Nomograms
```{r, fig.height=6, fig.width=6}
plot(nomogram(f))
```
## Point Estimates for Partial Effects
```{r}
summary(f)
```
Adjust `age` to 5, which has no effect as the model does not include any interactions.
```{r}
summary(f, age=5)
```
Effect of changing `ld73` from 20 to 40.
```{r}
summary(f, ld73=c(20,40))
```
If a predictor has a linear effect, a one-unit change can be used to get the confidence interval of its slope.
```{r}
summary(f, age=5:6)
```
There is also a `plot` method for `summary` results.
```{r}
plot(summary(f))
```
## Getting Predicted Values
Give `predict` a fit object and `data.frame`.
```{r}
predict(f, data.frame(age=3, ld72=21, ld73=21))
```
```{r}
predict(f, data.frame(age=c(3,10), ld72=21, ld73=c(21,47)))
```
```{r}
newdat <- expand.grid(age=c(4,8), ld72=c(21, 47), ld73=c(21, 47))
newdat
```
```{r}
predict(f, newdat)
```
Include confidence levels.
```{r}
predict(f, newdat, conf.int=0.95)
predict(f, newdat, conf.int=0.95, conf.type='individual')
```
Brute-force predicted values.
```{r}
b <- coef(f)
b[1] + b[2]*3 + b[3]*21 + b[4]*21
```
Predicted values with `Function`.
```{r}
# g <- Function(f)
g(age = c(3,8), ld72 = 21, ld73 = 21)
g(age = 3)
```
## ANOVA
Use `anova` to get all total effects and individual partial effects.
```{r}
an <- anova(f)
an
```
Include corresponding variable names.
```{r}
print(an, 'names')
print(an, 'subscripts')
print(an, 'dots')
```
Combined partial effects.
```{r}
anova(f, ld72, ld73)
```
## Exercise 4
Time for exercise 4.
# Cleaning and Importing Data
## Text files
* read.table - work with data in table format
* read.csv - CSV files
* read.delim - tab-delimited files
* scan - more general function for reading in data
## Binary representation of R objects
Compressed data and loads faster.
* RData
* `feather` package
## Database connections
Allows data queries.
* `RODBC` package
* `RSQLite` package
## Other statistical packages
* `Hmisc` package: `sas.get`, `sasxport.get`
* `haven` package: `read_sas`, `read_sav`, `read_dta`
* `foreign` package
* Stat/Transfer - not free nor open source
## Data Cleaning
An R `data.frame` consists of a collection of values. In a well-structured data set, each value is associated with a variable (column) and observation (row). Data sets often need to be manipulated to become well-structured.
Hadley Wickham's [Tidy Data][tidy] provides an excellent overview on how to clean messy data sets.
## Column header contains values
```{r}
# politics and religion
senateReligion <- data.frame(religion=c('Protestant','Catholic','Jewish','Mormon','Buddhist',"Don't Know/Refused"),
Democrats=c(20,15,8,1,1,3),
Republicans=c(38,9,0,5,0,0))
senateReligion
```
```{r}
senRel <- cbind(senateReligion[,'religion'], stack(senateReligion[,c('Democrats','Republicans')]))
names(senRel) <- c('religion','count','party')
senRel
```
## Multiple variables in a column
```{r}
pets <- data.frame(county=c('Davidson','Rutherford','Cannon'),
male.dog=c(50,150,70), female.dog=c(60,150,70),
male.cat=c(30,100,50), female.cat=c(30,70,40),
male.horse=c(6,30,20), female.horse=c(6,28,19))
pets
```
```{r}
pets2 <- cbind(pets[,'county'], stack(pets[,-1]))
names(pets2) <- c('county','count','ind')
pets2
```
Watch out for factor variables.
```{r}
tryCatch(strsplit(pets2[,'ind'], '\\.'), error=function(e){ e })
```
First `for` loop, allows iterating over a sequence.
```{r}
sexAnimal <- strsplit(as.character(pets2[,'ind']), '\\.')
pets2[,c('sex','animal')] <- NA
for(i in seq(nrow(pets2))) {
pets2[i, c('sex','animal')] <- sexAnimal[[i]]
}
pets2[,'ind'] <- NULL
pets2
```
## Variables in a row
```{r}
set.seed(1000)
d1 <- rnorm(1000, 5, 2)
d2 <- runif(1000, 0, 10)
d3 <- rpois(1000, 5)
d4 <- rbinom(1000, 10, 0.5)
x <- data.frame(distribution=c(rep('normal',2),rep('uniform',2),rep('poisson',2),rep('binomial',2)),
stat=rep(c('mean','sd'),4),
value=c(mean(d1), sd(d1), mean(d2), sd(d2), mean(d3), sd(d3), mean(d4), sd(d4)))
x
```
```{r}
cbind(distribution=x[c(1,3,5,7),'distribution'], unstack(x, form=value~stat))
```
## Normalization and merging
```{r}
employees <- data.frame(id=1:4, Name=c('Eddie','Andrea','Steve','Theresa'),
job=c('engineer','accountant','statistician','technician'))
set.seed(11)
hours <- data.frame(id=sample(4, 10, replace=TRUE),
week=1:10, hours=sample(30:60, 10, replace=TRUE))
merge(employees, hours)
# merge(employees, hours, by.x='id', by.y='id')
```
```{r}
empHours <- merge(employees, hours, all.x=TRUE)
empHours
```
```{r}
unique(empHours[,c('id','Name','job')])
empHours[,c('id','week','hours')]
```
Exclude rows with missing data.
```{r}
empHours[!is.na(empHours[,'week']), c('id','week','hours')]
```
## Combining data
```{r}
rbind(sexAnimal[[1]], sexAnimal[[2]], sexAnimal[[3]])
```
```{r}
do.call(rbind, sexAnimal)
```
## Exercise 5
Time for exercise 5.
# Simulation
Plot mean while sampling from normal distribution
```{r}
n <- 100
imean <- numeric(n)
smean <- numeric(n)
for(i in seq(n)) {
imean[i] <- mean(rnorm(10))
smean[i] <- mean(imean[seq(i)])
}
plot(imean, ylab='Sample Mean')
abline(h=0, lty=3)
lines(smean)
```
```{r}
n <- 1000
sig <- numeric(n)
for(i in seq(n)) {
grp1 <- rnorm(15, 60, 5)
grp2 <- rnorm(15, 65, 5)
sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)
```
Add some flexibility to our simulation by creating a function.
```{r}
tTestPower <- function(n=15, mu1=60, mu2=65, sd=5, nsim=1000) {
sig <- numeric(nsim)
for(i in seq(nsim)) {
grp1 <- rnorm(n, mu1, sd)
grp2 <- rnorm(n, mu2, sd)
sig[i] <- t.test(grp1, grp2, var.equal = TRUE)$p.value < 0.05
}
mean(sig)
}
tTestPower()
tTestPower(25, nsim=10000)
```
There's already a function to compute the power of a two-sample t test.
```{r}
power.t.test(n=25, delta=5, sd=5)$power
```
# Simulation and Bootstrapping
Simulation is a wonderful tool for really understanding just how well a method performs. Sometimes the modeling and techniques are complex enough that analytical and numerical properties are difficult to ascertain. Simulation can help answer that question by doing a procedure on random data repeatedly and examining how well the technique performs.
Bootstrap sampling is a related area that takes multiple random samples from a data set and performs estimates of statistical properties. It is robust in that it makes few assumptions about the structure of the data.
## Bootstrap sample
```{r}
true_mu <- 0
x <- rnorm(100, true_mu)
R <- 999
res <- matrix(nrow=R, ncol=6)
colnames(res) <- c('mu','se','lb','ub','coverage','bias')
for(i in seq(R)) {
r <- sample(x, replace=TRUE)
res[i,'mu'] <- mean(r)
res[i,'se'] <- sd(r) / sqrt(length(r))
}
res[,'lb'] <- res[,'mu'] + qnorm(0.025) * res[,'se']
res[,'ub'] <- res[,'mu'] + qnorm(0.975) * res[,'se']
res[,'coverage'] <- res[,'lb'] < true_mu & res[,'ub'] > true_mu
res[,'bias'] <- res[,'mu'] - true_mu
```
```{r}
res[1:5,]
vals <- colMeans(res)
basicCI <- 2 * mean(x) - quantile(res[,'mu'], probs=c(0.975, 0.025))
out <- sprintf("empirical SE: %.3f
MSE: %.3f
basic bootstrap confidence interval: (%.3f, %.3f)
coverage probability: %.3f
mean bias: %.3f
sample mean: %.3f
", sd(res[,'mu']), vals['se'], basicCI[1], basicCI[2], vals['coverage'], vals['bias'], mean(x))
cat(out)
```
## Control Structures
Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.
## Root finding with the Newton-Raphson algorithm
The `for` loop has already been introduced. Two other important control structures are `while` loops and `if` statements.
```{r}
f <- function(x) x^3 + x^2 - 3*x - 3
fp <- function(x) 3*x^2 + 2*x - 3
plot(f, xlim=c(-4,4))
```
```{r}
x <- -2
i <- 1
while(abs(f(x)) > 1e-8) {
x <- x - f(x) / fp(x)
i <- i + 1
if(i > 20) {
print("does not converge")
break
}
}
x
f(x)
```
## Exercise 6 & 7
Final exercises, 6 & 7.