-
Notifications
You must be signed in to change notification settings - Fork 42
/
05-vectorization.qmd
234 lines (153 loc) · 4.84 KB
/
05-vectorization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Vectorization
A European friend has a great job offer from USA but is concerned about gun violence.
The `murders` dataset in the **dslabs** package includes data on gun murders for the US 50 states and DC. Use this to prepare a report for your fried to help them decide where to live. Note your friend likes hiking so might prefer the west. Your friend does not like low population density.
```{r}
library(dslabs)
```
## Arithmetics
```{r}
heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
```
Convert to meters:
```{r}
heights * 2.54 / 100
```
Difference from the average:
```{r}
avg <- mean(heights)
heights - avg
```
Exercise: compute the height in standardized units
```{r}
s <- sd(heights)
(heights - avg) / s
# can also use scale(heights)
```
If it's two vectors, it does it component wise:
```{r}
heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error
```
Exercise:
Add a column to the murders dataset with the murder rate in per 100,000.
```{r}
library(dslabs)
murders$rate <- with(murders, total / population * 10^5)
```
## Functions that vectorize
Most arithmetic functions work on vectors
```{r}
x <- 1:10
sqrt(x)
log(x)
2^x
```
Note that the conditional function `if`-`else` does not vectorize. A particularly useful function is a vectorized version `ifelse`. Here is an example:
```{r}
a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA)
```
Other conditional functions, such as `any` and `all`, do vectorize.
## Indexing
Vectorization also works for logical relationships:
```{r}
ind <- murders$population < 10^6
```
You can subset a vector using these:
```{r}
murders$state[ind]
```
You can also use vectorization to apply logical operators:
```{r}
ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]
```
## split
Split is a useful function to get indexes using a factor.
```{r}
inds <- with(murders, split(seq_along(region), region))
murders$state[inds$West]
```
## Functions for subsetting
The functions `which`, `match` and the operator `%in%` are
useful for sub-setting
Here are some examples:
```{r}
ind <- which(murders$state == "California")
ind
murders[ind,]
```
```{r}
ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
```
```{r}
c("Boston", "Dakota", "Washington") %in% murders$state
```
## sapply
You can apply functions that don't vectorize. Like this one:
```{r}
s <- function(n){
return(sum(1:n))
}
```
Try it on a vector:
```{r}
ns <- c(25, 100, 1000)
s(ns)
```
We can use `sapply`
```{r}
sapply(ns, s)
```
`sapply` will work on any vector, including lists.
## Exercises
Now we are ready to help your friend. Let's give them options of places with low murders rates, mountains, and not too small.
For the following exercises do no load any packages other than **dslabs**.
(@) Show the subset of `murders` showing states with less than 1 per 100,000 deaths. Show all variables.
```{r}
if (exists("murders")) rm(murders)
library(dslabs)
murders$rate <- with(murders, total/population*10^5)
murders[murders$rate < 1,]
```
(@) Show the subset of `murders` showing states with less than 1 per 100,000 deaths and in the West of the US. Don't show the `region` variable.
```{r}
murders[murders$rate < 1 & murders$region == "West",]
```
(@) Show the largest state with a rate less than 1 per 100,000.
```{r}
dat <- murders[murders$rate < 1,]
dat[which.max(dat$population),]
```
(@) Show the state with a population of more than 10 million with the lowest rate.
```{r}
dat <- murders[murders$population >= 10^7,]
dat[which.min(dat$rate),]
```
(@) Compute the rate for each region of the US.
```{r}
indexes <- split(1:nrow(murders), murders$region)
sapply(indexes, function(ind) {
sum(murders$total[ind])/sum(murders$population[ind])*10^5
})
```
More practice exercises:
(@) Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the list have? Hint: use `seq` and `length`.
(@) Make this data frame:
```{r}
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",
"San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)
```
Convert the temperatures to Celsius.
(@) Compute the following sum
$$
S_n = 1+1/2^2 + 1/3^2 + \dots 1/n^2
$$
Show that as $n$ gets bigger we get closer $\pi^2/6$.
(@) Use the `%in%` operator and the predefined object `state.abb` to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?
(@) Extend the code you used in the previous exercise to report the one entry that is **not** an actual abbreviation. Hint: use the `!` operator, which turns `FALSE` into `TRUE` and viceversa, then `which` to obtain an index.
(@) Show all variables for New York, California, and Texas, in that order.