05-vectorization.qmd

# Vectorization 

A European friend has a great job offer from USA but is concerned about gun violence. 

The `murders` dataset in the **dslabs** package includes data on gun murders for the US 50 states and DC. Use this to prepare a report for your fried to help them decide where to live. Note your friend likes hiking so might prefer the west. Your friend does not like low population density.

```{r}
library(dslabs)
```

## Arithmetics

```{r}
heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
```

Convert to meters:

```{r}
heights * 2.54 / 100
```

Difference from the average:

```{r}
avg <- mean(heights)
heights - avg 
```

Exercise: compute the height in standardized units

```{r}
s <- sd(heights)
(heights - avg) / s
# can also use scale(heights)
```


If it's two vectors, it does it component wise:

```{r}
heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
error <- rnorm(length(heights), 0, 0.1)
heights + error
```


Exercise:

Add a column to the murders dataset with the murder rate in per 100,000.

```{r}
library(dslabs)
murders$rate <- with(murders, total / population * 10^5)
```


## Functions that vectorize

Most arithmetic functions work on vectors

```{r}
x <- 1:10
sqrt(x)
log(x)
2^x
```

Note that the conditional function `if`-`else` does not vectorize. A particularly useful function is a  vectorized version `ifelse`. Here is an example:

```{r}
a <- c(0, 1, 2, -4, 5)
ifelse(a > 0, 1/a, NA)
```

Other conditional functions, such as `any` and `all`, do vectorize.

## Indexing


Vectorization also works for logical relationships:

```{r}
ind <- murders$population < 10^6
```


You can subset a vector using these:

```{r}
murders$state[ind]
```


You can also use vectorization to apply logical operators:

```{r}
ind <- murders$population < 10^6 & murders$region == "West"
murders$state[ind]
```

## split

Split is a useful function to get indexes using a factor.
```{r}
inds <- with(murders, split(seq_along(region), region))
murders$state[inds$West]
```

## Functions for subsetting

The functions `which`, `match` and the operator `%in%` are 
useful for sub-setting

Here are some examples:

```{r}
ind <- which(murders$state == "California")
ind
murders[ind,]
```


```{r}
ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
```

```{r}
c("Boston", "Dakota", "Washington") %in% murders$state
```

## sapply

You can apply functions that don't vectorize. Like this one:

```{r}
s <- function(n){
   return(sum(1:n))
}
```

Try it on a vector:

```{r}
ns <- c(25, 100, 1000)
s(ns)
```

We can use `sapply`
```{r}
sapply(ns, s)
```

`sapply` will work on any vector, including lists.


## Exercises

Now we are ready to help your friend. Let's give them options of places with low murders rates, mountains, and not too small.

For the following exercises do no load any packages other than **dslabs**.

(@) Show the subset of `murders` showing states with less than 1 per 100,000 deaths. Show all variables.

```{r}
if (exists("murders")) rm(murders)
library(dslabs)

murders$rate <- with(murders, total/population*10^5)
murders[murders$rate < 1,]
```


(@) Show the subset of `murders` showing states with less than 1 per 100,000 deaths and in the West of the US. Don't show the `region` variable.

```{r}
murders[murders$rate < 1 & murders$region == "West",]
```


(@) Show the largest state with a rate less than 1 per 100,000.

```{r}
dat <- murders[murders$rate < 1,]
dat[which.max(dat$population),]
```

(@) Show the state with a population of more than 10 million with the lowest rate.

```{r}
dat <- murders[murders$population >= 10^7,]
dat[which.min(dat$rate),]
```

(@) Compute the rate for each region of the US.

```{r}
indexes <- split(1:nrow(murders), murders$region)
sapply(indexes, function(ind) {
  sum(murders$total[ind])/sum(murders$population[ind])*10^5
})
```


More practice exercises:

(@) Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6 + 4/7, 6 + 8/7, and so on. How many numbers does the list have? Hint: use `seq` and `length`.

(@) Make this data frame:

```{r}
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", 
          "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)
```

Convert the temperatures to Celsius.

(@) Compute the following sum  

$$
S_n = 1+1/2^2 + 1/3^2 + \dots 1/n^2
$$

Show that as $n$ gets bigger we get closer $\pi^2/6$.

(@) Use the `%in%` operator and the predefined object `state.abb` to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?


(@) Extend the code you used in the previous exercise to report the one entry that is **not** an actual abbreviation. Hint: use the `!` operator, which turns `FALSE` into `TRUE` and viceversa, then `which` to obtain an index.

(@) Show all variables for New York, California, and Texas, in that order.