forked from STAT545-UBC/STAT545-UBC-original-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
block011_write-your-own-function-01.rmd
233 lines (158 loc) · 9.25 KB
/
block011_write-your-own-function-01.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: "Write your own R functions, part 1"
output:
html_document:
toc: true
toc_depth: 3
---
```{r setup, include = FALSE, cache = FALSE}
knitr::opts_chunk$set(error = TRUE, collapse = TRUE)
```
### What and why?
My goal here is to reveal the __process__ one long-time useR employs for writing functions. I also want to illustrate why the process is the way it is. Merely looking at the finished product, e.g. source code for R packages, can be extremely deceiving. Reality is generally much uglier ... but more interesting!
Why are we covering this now, smack in the middle of data aggregation? Powerful machines like `dplyr`, `plyr`, and even the built-in `apply` family of functions, are ready and waiting to apply your purpose-built functions to various bits of your data. If you can express your analytical wishes in a function, these tools will give you great power.
### Load the Gapminder data
As usual, load the Gapminder excerpt.
```{r}
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## or do this if the file isn't lying around already
## gd_url <- "http://tiny.cc/gapminder"
## gDat <- read.delim(gd_url)
```
### Max - min
Say you've got a numeric vector. Compute the difference between its max and min. `lifeExp` or `pop` or `gdpPercap` are great examples of a typical input. You can imagine wanting to get this statistic after we slice up the Gapminder data by year, country, continent, or combinations thereof.
### Get something that works
First, develop some working code for interactive use, using a representative input. I'll use Gapminder's life expectancy variable.
R functions that will be useful: `min()`, `max()`, `range()`. __Read their documentation.__
```{r}
## get to know the functions mentioned above
min(gDat$lifeExp)
max(gDat$lifeExp)
range(gDat$lifeExp)
## some natural solutions
max(gDat$lifeExp) - min(gDat$lifeExp)
with(gDat, max(lifeExp) - min(lifeExp))
range(gDat$lifeExp)[2] - range(gDat$lifeExp)[1]
with(gDat, range(lifeExp)[2] - range(lifeExp)[1])
diff(range(gDat$lifeExp))
```
Internalize this "answer" because our informal testing relies on you noticing departures from this.
#### Skateboard >> perfectly formed rear-view mirror
This image [widely attributed to the Spotify development team](http://blog.fastmonkeys.com/?utm_content=bufferc2d6e&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) conveys an important point.
![alt text](img/spotify-howtobuildmvp.gif)
Build that skateboard before you build the car or some fancy car part. A limited-but-functioning thing is very useful. It also keeps the spirits high.
This is related to the valuable [Telescope Rule](http://c2.com/cgi/wiki?TelescopeRule):
> It is faster to make a four-inch mirror then a six-inch mirror than to make a six-inch mirror.
### Turn the working interactive code into a function
Add NO new functionality! Just write your very first R function.
```{r}
max_minus_min <- function(x) max(x) - min(x)
max_minus_min(gDat$lifeExp)
```
Check that you're getting the same answer as you did with your interactive code. Test it eyeball-o-metrically at this point.
### Test your function
#### Test on new inputs
Pick some new articial inputs where you know (at least approximately) what your function should return.
```{r}
max_minus_min(1:10)
max_minus_min(runif(1000))
```
I know that 10 minus 1 is 9. I know that random uniform [0, 1] variates will be between 0 and 1. Therefore max - min should be less than 1. If I take LOTS of them, max - min should be pretty close to 1.
It is intentional that I tested on integer input as well as floating point. Likewise, I like to use valid-but-random data for this sort of check.
#### Test on real data but *different* real data
Back to the real world now. Two other quantitative variables are lying around: `gdpPercap` and `pop`. Let's have a go.
```{r}
max_minus_min(gDat$gdpPercap)
max_minus_min(gDat$pop)
```
Either check these results "by hand" or apply the "does that even make sense?" test.
#### Test on weird stuff
Now we try to break our function. Don't get truly diabolical (yet). Just make the kind of mistakes you can imagine making at 2am when, 3 years from now, you rediscover this useful function you wrote. Give you function inputs it's not expecting.
```{r}
max_minus_min(gDat) ## hey sometimes things "just work" on data.frames!
max_minus_min(gDat$country) ## factors are kind of like integer vectors, no?
max_minus_min("eggplants are purple") ## i have no excuse for this one
```
How happy are you with those error messages? You must imagine that some entire __script__ has failed and that you were hoping to just source it without re-reading it. If a colleague or future you encountered these errors, do you run screaming from the room? How hard is it to pinpoint the usage problem?
#### I will scare you now
Here are some great examples STAT545 students devised during class where the function __should break but it does not.__
```{r}
max_minus_min(gDat[c('lifeExp', 'gdpPercap', 'pop')])
max_minus_min(c(TRUE, TRUE, FALSE, TRUE, TRUE))
```
In both cases, R's eagerness to make sense of our requests is unfortunately successful. In the first case, a data.frame containing just the quantitative variables is eventually coerced into numeric vector. We can compute max minus min, even though it makes absolutely no sense at all. In the second case, a logical vector is converted to zeroes and ones, which might merit an error or at least a warning.
### Check the validity of arguments
For functions that will be used again -- which is not all of them! -- it is good to check the validity of arguments. This implements a rule from [the Unix philosophy](http://www.faqs.org/docs/artu/ch01s06.html):
> Rule of Repair: When you must fail, fail noisily and as soon as possible.
#### stopifnot
`stopifnot()` is the entry level solution. I use it here to make sure the input `x` is a numeric vector.
```{r}
mmm <- function(x) {
stopifnot(is.numeric(x))
max(x) - min(x)
}
mmm(gDat)
mmm(gDat$country)
mmm("eggplants are purple")
mmm(gDat[c('lifeExp', 'gdpPercap', 'pop')])
mmm(c(TRUE, TRUE, FALSE, TRUE, TRUE))
```
And we see that it catches all of the self-inflicted damage we would like to avoid.
#### if then stop
`stopifnot()` doesn't provide a very good error message. The next approach is very widely used. Put your validity check inside an `if()` statement and call `stop()` yourself, with a custom error message, in the body.
```{r}
mmm2 <- function(x) {
if(!is.numeric(x)) {
stop('I am so sorry, but this function only works for numeric input!')
}
max(x) - min(x)
}
mmm2(gDat)
```
In addition to offering an apology, note the error raised also contains helpful info on *which* function threw the error. Nice touch.
*Note: the above is true when run interactively but currently not true in the rendered document. That is a glitch in `knitr` that is getting straightened out.*
### Packages for formal checks at run time
The [`assertthat` package](https://github.com/hadley/assertthat) "provides a drop in replacement for `stopifnot()`." That is quite literally true. The function `mmm3` differs from `mmm2` only in the replacement of `stopifnot()` by `assert_that()`.
```{r}
## install if you do not already have!
## install.packages(assertthat)
library(assertthat)
mmm3 <- function(x) {
assert_that(is.numeric(x))
max(x) - min(x)
}
mmm3(gDat)
```
The [`ensurer` package](https://github.com/smbache/ensurer) is another, newer package with some similar goals, so you may want to check that out as well.
```{r echo = FALSE, eval = FALSE}
## install if you do not already have!
## devtools::install_github("smbache/ensurer")
library(ensurer)
mmm4 <- function(x) {
ensures_that(is.numeric(x))
max(x) - min(x)
}
mmm4(gDat)
```
#### Sidebar: other uses for `assertthat` or `ensurer`
Another good use of these packages is to leave checks behind in data analytical scripts. Consider our repetitive use of Gapminder. Every time we load this data, we inspect it, e.g., with `str()`. Informally, we're checking that is still has `r nrow(gDat)` rows. But we could, and probably should, formalize that with a call like `assert_that(nrow(gDat) == 1704)`. This would tell us if the data suddenly changed, alerting us to a problem with the data file or the import. This can be a useful wake-up call in scripts that you re-run alot as you build a pipeline, where it's easy to zone out and stop paying attention.
### Wrap-up and what's next?
Here's the function we've written so far:
```{r}
mmm3
```
What we've accomplished:
* we're written our first function
* we are checking the validity of its input, argument `x`
* we've done a good amount of informal testing
Where to next? In [part 2](block011_write-your-own-function-02.html), we generalize this function to take differences in other quantiles and learn how to set default values for arguments.
### Resources
Packages
* [`assertthat` package](https://github.com/hadley/assertthat)
* [`ensurer` package](https://github.com/smbache/ensurer)
* [`testthat` package](https://github.com/hadley/testthat)
Hadley Wickham's forthcoming book [Advanced R](http://adv-r.had.co.nz)
* Section on [defensive programming](http://adv-r.had.co.nz/Exceptions-Debugging.html#defensive-programming)
Hadley Wickham's forthcoming book [R packages](http://r-pkgs.had.co.nz)
* [Testing chapter](http://r-pkgs.had.co.nz/tests.html)