-
Notifications
You must be signed in to change notification settings - Fork 54
/
Copy pathstrategy-objects.qmd
145 lines (107 loc) · 6.98 KB
/
strategy-objects.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# Extract strategies into objects {#sec-strategy-objects}
```{r}
#| include: FALSE
source("common.R")
```
## What's the problem?
Sometimes different strategies need different arguments.
In this case, instead of using an enum, you'll need to use richer objects capable of storing optional values as well as the strategy name.
This pattern is similar to combining @sec-argument-clutter and @sec-enumerate-options together.
## What are some examples?
- `grepl()` has Boolean `perl` and `fixed` arguments, but you're not really toggling two independent settings, you're picking from one of three regular expression engines (the default, the engine used by Perl, and fixed matches).
Additionally, the `ignore.case` argument only applies to two of the strategies.
In stringr, however, you use helper functions like `regex()` and `fixed()` to wrap around the pattern, and supply optional arguments that only apply to that strategy.
- `ggplot2::geom_histogram()` has three main strategies for defining the bins: you can supply the number of `bins`, the width of each bin (the `binwidth`), or the exact `breaks`.
But it's currently difficult to derive this from the function specification, and there are complex argument dependencies (e.g. you can only supply one of `boundary` and `center`, and neither applies if you use `breaks`).
- `dplyr::left_join()` uses an advanced form of this pattern where the different strategies for joining two data frames together are expressed in a mini-DSL provided by `dplyr::join_by()`.
## How do you use the pattern?
In more complicated cases, different strategies will require different arguments, so you'll need a bit more infrastructure.
The basic idea is to build on the options object described in @sec-argument-clutter, but instead of providing just one helper function, you'll provide one function per strategy.
This is the way stringr works: you can select a different matching engine by wrapping the `pattern` in one of `regex()`, `boundary()`, `coll()`, or `fixed()`.
We'll explore how stringr ended up with design and how you can implement something similar yourself by looking at the base regular expression functions.
### Selecting a pattern engine
The basic regular expression functions (`grep()`, `grepl()`, `sub()`, `gsub()`, `regexpr()`, `gregexpr()`, `regexec()`, and `gregexec()`) all `fixed` and `perl` arguments that allow to select the regular expression engine that's used:
- `perl = FALSE`, `fixed = FALSE`, the default, uses POSIX 1003.2 extended regular expressions.
- `perl = TRUE`, `fixed = FALSE` uses Perl-style regular expressions.
- `perl = FALSE`, `fixed = TRUE` uses fixed matching.
- `perl = TRUE`, `fixed = TRUE` is an error.
You could make this choice more clear by using an enumeration (@sec-enumerate-options) maybe something like `engine = c("POSIX", "perl", "fixed")`.
That might look something like this:
```{r}
#| eval: false
grepl(pattern, string, engine = "regex")
grepl(pattern, string, engine = "fixed")
grepl(pattern, string, engine = "perl")
```
But there's an additional argument that throws a spanner in the works: `ignore.case = TRUE` only works with two of the three engines: POSIX and perl.
Additionally, it's a bit unforunate that the `engine` argument, which is likely to come later in the call, affects the `pattern`, the first argument.
That means you have to read the call until you see the `engine` argument before you can understand precisely what the `pattern` means.
An alternative approach, as used by stringr, is to provide some helper functions that encode the engine as an attribute of the pattern:
```{r}
#| eval: FALSE
grepl(pattern, regex(string))
grepl(pattern, fixed(string))
grepl(pattern, perl(string))
```
And because these are separate functions, they can take different arguments:
```{r}
regex <- function(pattern, ignore.case = FALSE) {}
perl <- function(pattern, ignore.case = FALSE) {}
fixed <- function(pattern) {}
```
This gives a very flexible interface which is particularly nice in stringr because it means there's an easy way to support boundary matching, which doesn't even take a pattern:
```{r}
#| message: false
library(stringr)
str_view("This is a sentence.", boundary("word"))
str_view("This is a sentence.", boundary("sentence"))
```
### Implementation
Lets flesh this interface into an implementation.
First we flesh out the pattern engine wrappers.
These need to return an object that has the name of engine, the pattern, and any other arguments:
```{r}
regex <- function(pattern, ignore.case = FALSE) {
list(pattern = pattern, engine = "regex", ignore.case = ignore.case)
}
perl <- function(pattern, ignore.case = FALSE) {
list(pattern = pattern, engine = "perl", ignore.case = ignore.case)
}
fixed <- function(pattern) {
list(pattern = pattern, engine = "fixed")
}
```
Then you could create a new `grepl()` variant that might look something like this:
```{r}
my_grepl <- function(pattern, x, useBytes = FALSE) {
switch(pattern$engine,
regex = grepl(pattern$pattern, x, ignore.case = pattern$ignore.case, useBytes = useBytes),
perl = grepl(pattern$pattern, x, perl = TRUE, ignore.case = pattern$ignore.case, useBytes = useBytes),
fixed = grepl(pattern$pattern, x, fixed = TRUE, useBytes = useBytes)
)
}
```
Or if you wanted to make it more clear how the engines differ, you could pull out a helper function that pulls out the repeated code:
```{r}
my_grepl <- function(pattern, x, useBytes = FALSE) {
grepl_wrapper <- function(...) {
grepl(pattern$pattern, x, ..., useBytes = useBytes)
}
switch(pattern$engine,
regex = grepl_wrapper(ignore.case = pattern$ignore.case),
perl = grepl_wrapper(perl = TRUE, ignore.case = pattern$ignore.case),
fixed = grepl_wrapper(fixed = TRUE)
)
}
```
Here I'm just wrapping around the existing `grepl()` implementation because I don't want to go into the details of its implementation; for your own code you'd probably inline the implementation.
I particularly like the `switch` pattern here and in stringr because it keeps the function calls close together, which makes it easier to keep them in sync.
You could also implement the same strategy using `if` or S7 generic functions, depending on your needs.
This is implementation a sketch that gives you the basic ideas.
For a real implementation you'd also need to consider:
- Are `fixed()`, `perl()`, and `regex()` the right names? Would it be useful to give them a common prefix?
- It would be better for the engines to return an S7 object instead of a list, so we could provide a print method to make them display more nicely.
- `grepl()` needs some error checking to ensure that `pattern` is generated by one of the engines, and probably should have a default path to handle bare character vectors as regular expressions (the current default).
You can see these detailed worked out in the stringr package if you look at the source code, particularly that of `fixed()`, `type()`, `opts()`, then `str_detect()`.
## How do I remediate past problems?
Changing from a complex dependency of individual arguments to a stra