forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
OO-tradeoffs.Rmd
205 lines (140 loc) · 12.9 KB
/
OO-tradeoffs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# Trade-offs {#oo-tradeoffs}
You have just learned about the three most important OOP toolkits available in R. Now that you understand their basic operation and the principles that underlie them, we can start to compare and constrast the systems and understand their relative strengths and weaknesses. This will help you understand which system is most likely to solve the particular problem that you have at hand.
All else being equal, I recommend that you use S3. S3 is simple and widely used throughout base R and contributed packages. While it's far from perfect, it's idiosyncracries are well understood and there are known approaches to overcome its shortcomings. Always start by first considering if it's possible to use S3 to solve your problem.
If you have an existing background in programming you will probably lean towards R6, because it will feel familiar. I think you should resist this tendency for two reasons. Firstly, if you use R6 it's very easy to create an non-idiomatic API that will feel very odd to native R users, and will have surprising pain points because of the reference semantics. Secondly, if you stick to R6, you'll lose out on learning a new way of thinking about OOP that gives you a new set of tools for solving problems.
The remainder of this chapter describes when you should use S4 or R6 instead. Comparing S3 to S4 only requires a brief discussion since the essence of S3 and S4 is so similar. The comparison of S3 and R6 is much longer because the two systems are profoundly different.
## S4 vs S3
Once you've mastered S3, S4 is relatively easy to pick up: the underlying ideas are the same, S4 is just more formal, more strict, and more verbose.
The strictness and formalism of S4 make it well suited for large teams. Since more structure is provided by the system itself, there is less need for convention, and you don't as much education for new contributors. S4 tends to require more upfront design than S3, and this investment tends to be more likely to pay off on larger projects because greater resources are available.
One large team effort where S4 is used to good effect is Bioconductor. Bioconductor is similar to CRAN: it's a way of sharing packages amongst a wider audience. Bioconductor is smaller than CRAN (~1,300 vs ~10,000 packages, July 2017) and the packages tend to be more tightly intergrated because of the shared domain, and a stricter review process. Bioconductor packages are not required to use S4, but most will because the key data structures (e.g. SummarizedExperiment, IRanges, DNAStringSet) are built using S4.
```{r, include = FALSE}
library(Matrix)
ver <- packageVersion("Matrix")
gs <- getGenerics("package:Matrix")
generics <- gs@.Data[gs@package == "Matrix"]
n_generics <- length(generics)
classes <- getClasses("package:Matrix", FALSE)
n_classes <- length(classes)
methods <- lapply(gs@.Data, findMethods)
n_methods <- length(unlist(methods, recursive = FALSE))
```
S4 is also a good fit when you have a complicated system of interrelated objects, and it's possible to minimise code duplication through careful method implementation. The best example of this use case is the Matrix package by Douglas Bates and Martin Mächler. It is designed to efficiently store and compute with many different types of sparse and dense matrices. As of version `r ver`, it defines `r n_classes` classes, `r n_generics` generic functions, and `r n_methods` methods. To give you some idea of the complexity, a small subset of the class graph is shown in Figure \@ref(fig:matrix-classes).
```{r matrix-classes, echo = FALSE, out.width = NULL, fig.cap= "A small subset of the Matrix class graph showing the inheritance of sparse matrices. Each concrete class inherits from two virtual parents: one that describes how the data is stored (C = column oriented, R = row oriented, T = tagged) and one that describes any restriction on the matrix (s = symmetric, t = triangle, g = general)"}
knitr::include_graphics("diagrams/s4-matrix-dsparseMatrix.png", dpi = 300)
```
This domain is a good fit for S4 because there are often computational shortcuts for specific types of sparse matrix. S4 makes it easy to provide a general method that works for all inputs, and then provide a selection of more specialised computations where the specific data structures allow for a more efficient implementation. This requires careful planning to avoid method dispatch ambiguity, but for complicated system the planning pays offs.
The biggest challenge to using S4 is the combination of increased complexity and absence of a single place to learn. The documentation for S4 is scattered over multiple man pages, books, and websites. S4 is a complex system and can be challenging to use effectively in practice. It deserves a book length treatment, but that book does not (yet) exist. (The documentation for S3 is no better, but because S3 is much simpler the lack is less painful.)
## R6 vs S3
There are three primary differences between S3 and R6:
* In R6, methods belong to objects. In S3, methods belong to generic functions.
This leads to some differences around namespacing and an alternative to the
pipe.
* R6 objects are mutable; they do not copy-on-modify. This makes
allow you to avoid a painful process called "threading state", and
makes them suitable for modelling real world objects (which do change
over time).
* In R6, you can hide data and methods from the end user in private fields.
In S3, you can not. This leads to some important trade-offs.
### Namespacing
One non-obvious difference between S3 and R6 is the "space" in which methods are found. Generic functions are global: all packages have to share the same namespace. Encapsulated methods are local: methods are bound to a single object.
There are two inter-related reasons that global namespaces of generic functions means that you need to think more about naming. You want to minimise the use of generics with the same name in different packages because that requires the user to type `::` frequently. This is particularly challenging because function names are usually written in English, and English words have multiple meanings. Generally, you should avoid using homonyms of the original generic:
```{r, eval = FALSE}
plot(data) # plot some data
plot(bank_heist) # plot a crime
plot(land) # create a new plot of land
plot(movie) # extract plot of a movie
```
This same problem doesn't occur with R6 methods because they are scoped to the object. This code is fine, because there is no implication that the plot method of two different R6 objects means the same thing:
```{r, eval = FALSE}
data$plot()
bank_heist$plot()
land$plot()
movie$plot()
```
The reason that S3 works so well is in data analyses you often want to do the same thing to different types of objects. For example, every model function in R understands `summary()` and `predict()`. The generic functions provide a uniform API that make it much easier to do typical things to a new object.
In R6, creating a method is basically free. And in many encapsulated OO languages you are encouraged to create many small methods, each doing one thing well with an evocative name. The same advice does not apply to S3: it's still a good idea to break your code down into small, easily understood chunks, but they should not be methods, because creating a new method is expensive (because have to also create a new generic, which is hard).
#### Method chaining {#tradeoffs-pipe}
Any R6 method that is primarily called for its side-effects (usually modifying the object) should return `invisible(self)`. This allows the user to chain together multiple method calls in a single expression, a technique known as __method chaining__.
```{r, eval = FALSE}
s$
push(10)$
push(20)$
pop()
```
Method chaining achieves similar goals to the pipe (`%>%`). The goal of both techniques is to allow you to read code from left-to-right, as an imperative series of actions, do this, then do that, then do something else.
Each technique has strengths and weaknesses. The primary advantage of method chaining is that you get useful autocomplete; the primary disadvantage is that only the creator of the class can add new methods (and there's no way to use multiple dispatch).
### Mutability
If you've programmed in a mainstream OO language, RC will seem very natural. But because they can introduce side effects through mutable state, they are harder to understand. For example, when you call `f(a, b)` in R you can usually assume that `a` and `b` will not be modified. But if `a` and `b` are RC objects, they might be modified in place. Generally, when using RC objects you want to minimise side effects as much as possible, and use them only where mutable states are absolutely required. The majority of functions should still be "functional", and free of side effects. This makes code easier to reason about and easier for other R programmers to understand.
It's possible to get the best of both worlds. Use R6 internally, but don't expose.
#### Threading state
For example, imagine you want to create a stack of objects. There are two main methods for a stack: push adds a new object to the top of the stack, and pop removes it. The implementation of `stack` in S3 is fairly simple:
```{r}
new_stack <- function(items = list()) {
structure(list(items = items), class = "stack")
}
length.stack <- function(x) length(x$items)
push <- function(x, y) {
x$items[length(x) + 1] <- y
x
}
```
Until we get to `pop()`. Pop is challenging because it has to both return a value (the object at the top of the stack), and have a side-effect (remove that object from that top). How can we do this in S3, where we can modify the input object? We need to return two things: the value, and the updated object:
```{r}
pop <- function(x) {
n <- length(x)
item <- x$items[[n]]
x$items <- x$items[-n]
list(item = item, x = x)
}
```
(Note that I've chosen not to make `push()` and `pop()` generic because there currently aren't any other data structures that use them.)
This leads to rather awkward usage:
```{r}
s <- new_stack()
s <- push(s, 10)
s <- push(s, 20)
out <- pop(s)
out$item
s <- out$x
s
```
This problem is known as __threading state__ or __accumulator programming__, because no matter how deeply the `pop()` is called, you have to feed the modified stack object all the way back to where the stack lives.
One way that other FP languages deal with this challenge is to expose a "multiple assign" (destructing bind) that allows you to assign multiple values in a single step. The zeallot R package, by Nathan and Paul Teetor, provides multi-assign for R: `%<-%`. This makes the code more elegant.
```{r}
library(zeallot)
c(value, s) %<-% pop(s)
value
```
Compare to an R6 implementation. The implementation of the class is basically identical: only the structure of the methods has changed. The main difference is in `$pop()`. Because the object is mutable, we can modify the object itself, and don't need to return it.
```{r}
Stack <- R6::R6Class("Stack", list(
items = list(),
push = function(x) {
self$items[[self$length() + 1]] <- x
invisible(self)
},
pop = function() {
item <- self$items[[self$length()]]
self$items <- self$items[-self$length()]
item
},
length = function() {
length(self$items)
}
))
```
This leads to
```{r}
s <- Stack$new()
s$push(10)
s$push(20)
s$pop()
```
#### Changing objects
Another option would be to build the S3 object on top of an environment, which has reference semantics. In general, I don't think this is a good idea because then you've created an object that looks like a regular R object from the outside, but has reference semantics. Better to keep them clearly separate.
The presumption of S3 methods is that they are pure: calling the same method with the same inputs should return the same output. The presumption of R6 methods is that they are not pure: you can only expect purity if explicitly documented to be so.
This also means R6 is a more natural interface to things in the real-world which do change over time. For example, the processx package models an external process: it does change over time, so to have an S3 object is fundamentally misleading.
### Privacy
Another difference with R6 is that you can have private fields that are not easily accessible to the user. There is no way to do the same in S3. There are advantages and disadvantages to private methods and fields. On the plus side, private elements enforce a "separation of concerns" making it possible to clearly delineate what you consider to be an internal implementation detail, compared to what the user should work with. On the downside, you need to more carefully consider what to expose: the user can not easily reach inside your object and get what they want.
Privacy is unambiguously good in most programming languages. But most R users are familiar with reaching inside an S3 object to get what they want.
R is not as strict as other programming languages. It's contracts are not enforced by a team of lawyers. They are a hand shake between friends.