07-multivar_eda.Rmd

# Multivariate Data Exploration {#eda2}

```{r eda2, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center')
library(tidyverse)
library(knitr)
library(scales)
library(GGally)
library(kableExtra)
library(gridExtra)
```

## Chapter 7 Objectives

This Chapter is designed around the following learning objectives. Upon
completion, you should be able to:

- Understand the use and context of bivariate data
- Recognize the motivation behind a scatterplot from different perspectives
- Define correlation, causation, and the difference
- Create and interpret a scatterplot between two variables
- Leverage `ggplot2` for multivariate exploratory data analysis through facets and colors

## Bivariate Data {#bivariate}

Whereas univariate data analyses are directed at "getting to know" the
observations made for a single variable, bivariate---and multivariate---
analyses are designed to examine the *relationship* that may exist between two
(or more) variables. Like the Chapter on [Univariate EDA]({#eda1), we will
focus first on ***data exploration***, which is a key step towards "getting to
know" your data and one that should always proceed inferential statistics, or
*making conclusions about your data*.

```{block, type="rmdnote"}
Bivariate means *two variables* where the observations are paired; each 
observation samples both variables so that they are linked.
```

## Scatterplot {#scatt}

Undoubtedly, you have seen scatterplots many times before; we will discuss them
in more detail here. The **scatterplot** allows you to assess the strength,
direction, and type of relationship between two variables. This can be
important for determining factors like:  

* Correlation  
* Linearity  
* Performance (of a measurement) in terms of precision, bias, and dynamic range

Traditionally, a scatterplot shows paired observations of two variables with
the ***dependent variable*** on the y-axis and the ***independent variable***
on the x-axis. Creating a plot in this way means that, before you begin, you
must make a judgment call about the direction of the relationship (i.e., which
variable *depends* on which). This relates to the scientific method and 
*linear regression*; we will discuss the latter in more detail later. For the
purposes of exploratory data analysis, however, it actually doesn't matter
which variable goes on which axis. That said, since we don't wish to break with
tradition, let's agree to follow the guidelines on independent (predictor) and 
dependent (outcome/response) variables. Although related, the researcher's
purpose or "mode" affects the intention behind a plot: 

 - **Scientific Method:** The experimenter manipulates the control variable (independent; on x-axis) and observes the changes in the response variable (dependent; y-axis).   
 
 - **Statistics:** The independent variable (x-axis) is thought to have some predictive ability, correlation with, or control over the dependent variable (y-axis).

 - **Exploratory Data Analysis:** We throw two variables on a plot to investigate their relationship. We make a guess about which is the independent variable (x-axis) and which is the dependent variable (y-axis), and we hope that nobody calls us out if we got it wrong...  
 
Scatterplots are created using `geom_point()` by naming `x = ` and `y = ` variables within the `aes()` function.  

### Causality

All this talk about dependent and independent variables is fundamentally rooted
in the practice of ***causal inference*** reasoning, which is the ability to 
say that "action A *caused* outcome B". Discovering---*or proving*---that one
thing caused another to happen can be incredibly powerful. Proving causality
leads to Nobel Prizes, creation of new laws and regulations, judgment of guilt
or innocence in court, changing human behavior and convincing human minds, and,
simply put, more understanding.

A full treatment of causal inference reasoning is beyond the scope of this
course, but we will, from time to time, delve into this topic.  The art of data
science can be a beautiful and compelling way to demonstrate causality, but we
need to learn to crawl before we can walk, run, or fly. For now, let's put
aside the pursuit of causation and begin with ***correlation***.

### Correlation {#corr}

The ***scatterplot*** is a great way to visualize *whether*, and, to some
extent, *how*, two variables are related to each other.  

```{block, type="rmdnote"}
Correlation: A mutual relationship or connection between two or more things;
the process of establishing a relationship or connection between two or more
measures. The variables can move up or down together or be inversely related.
```

```{r corr_plots1, warning=FALSE, echo=FALSE, include=FALSE}
# set seed to repeat values
set.seed(7)
# generate data for corr plots
corr_plots1 <- tibble::tibble(x = runif(100, 0, 100),
                              y1 = x,
                              y2 = purrr::map_dbl(x, 
                                                  ~ (.x*(rnorm(1, 
                                                               mean = 1,  
                                                               sd = 0.06)) +
                                                       rnorm(1, 
                                                             mean = 0, 
                                                             sd = 5))),
                              y3 = purrr::map_dbl(x,
                                                  ~ (.x*(rnorm(1, 
                                                               mean = 0.6,  
                                                               sd = 0.2)) +
                                                       rnorm(1, 
                                                             mean = 15, 
                                                             sd = 15))),
                              y4 = runif(100, 0, 105)) %>%
  tidyr::pivot_longer(y1:y4, names_to = "type") %>%
  dplyr::mutate(type = as.factor(type))
# label types of correlation
levels(corr_plots1$type) <- c("Perfect Correlation", 
                              "Strong Correlation", 
                              "Moderate Correlation", 
                              "No Correlation")
# create corr plot panel
p1 <- ggplot2::ggplot(data = corr_plots1,
                      mapping = aes(x = x,
                                    y = value)) +
  geom_point(aes(color = type),
             alpha = 0.5) +
  facet_wrap(facets = vars(type),
             ncol = 2) +
  theme_bw() +
  coord_fixed(ratio = 1.0) +
  labs(x = "Independent Variable (x-axis)",
       y = "Dependent Variable (y-axis)") +
  theme(legend.position = "none")
# save plot to embed
ggsave("./images/correlation_example_1.png", dpi = 150)
```

Below are four examples of bivariate data (shown in scatterplots) with differing degrees of
correlation: perfect, strong, moderate, and none. These are qualitative terms,
of course; what is "moderate" to one person may be poor and unacceptable to
another. The qualitative strength of the correlation also depends on the 
research context. 

If you want to understand how to assess the strength of correlation 
quantitatively, you can explore the Pearson Correlation Coefficient (***r***)
in the [Appendix](#pearson), which is used to quantify the degree of *linear*
correlation between two variables.

```{r corr-example-1, echo=FALSE,  out.width="700pt", fig.align="center", fig.cap="Scatterplot examples showing bivariate data with varying degrees of correlation."}
knitr::include_graphics("./images/correlation_example_1.png")
```

```{r corr-plots2, warning=FALSE, echo=FALSE, include=FALSE}
# generate data for corr strength plots
corr_plots2 <- tibble::tibble(x = runif(100, 0, 100),
                              y1 = purrr::map_dbl(x, 
                                                  ~ (.x*(rnorm(1, 
                                                               mean = .8,  
                                                               sd = 0.15)) + 
                                                       rnorm(1, 
                                                             mean = 10, 
                                                             sd = 10))),
                              y2 = purrr::map_dbl(x, 
                                                  ~ -(.x*(rnorm(1, 
                                                                mean = 0.6,  
                                                                sd = 0.06)) -
                                                        rnorm(1, 
                                                              mean = 70, 
                                                              sd = 5))),
                              y3 = purrr::map_dbl(x, 
                                                  ~ (.x*(rnorm(1, 
                                                               mean = 1,  
                                                               sd = 0.02))^2 +
                                                       rnorm(1, 
                                                             mean = 0, 
                                                             sd = 2))),
                              y4 = purrr::map_dbl(x, 
                                                  ~ (100*exp(-.x/20)*
                                                       (rnorm(1,
                                                              mean = 1,  
                                                              sd = 0.03)) +
                                                       rnorm(1, 
                                                             mean = 4, 
                                                             sd = 2)))

                              ) %>%
  tidyr::pivot_longer(y1:y4, names_to = "type") %>%
  dplyr::mutate(types = as.factor(type))
# label types of correlation
levels(corr_plots2$types) <- c("Positive Correlation", 
                               "Negative Correlation", 
                               "Linear Correlation", 
                               "Non-Linear Correlation")
# create corr plot panel
p2 <- ggplot2::ggplot(data = corr_plots2,
                      mapping = aes(x = x, 
                                    y = value)) +
  geom_point(aes(color = types,
                 alpha = 0.5)) +
  facet_wrap(facets = vars(types)) +
  theme_bw() +
  coord_fixed(ratio = 1) +
  theme(legend.position = "none") +
  labs(x = "Independent Variable",
       y = "Dependent Variable")
# save plot to embed
ggsave("./images/correlation_example_2.png", dpi = 150)
```

In addition to the strength of the correlation, the sign and form of the
correlation can vary, too:  

  - **positive correlation**: the dependent variable *trends in the same direction* as the independent variable; when `y` increases, `x` increases, too.   
  - **negative correlation**: the dependent variable *decreases* when the independent variable *increases*  
  - **linear correlation**: the relationship between the two variables can be shown with a straight line  
  - **non-linear correlation**: the relationship between the two variables is curvilinear  

```{r corr-example-2, echo=FALSE, out.width="700pt", fig.align="center", fig.cap="Scatterplot examples showing bivariate data with varying types of correlation."}
knitr::include_graphics("./images/correlation_example_2.png")
```

### Correlation $\neq$ causation

Did you know that being a smoker is correlated with having a lighter in your
pocket? Furthermore, it can be shown that keeping a lighter in your pocket is
correlated with an increased risk of developing heart disease and lung cancer.
Does this mean lighters in your pocket *cause* lung cancer?

```{block, type="rmdnote"}
Causation: the process or condition by which one event (cause) contributes to
the occurrence of another event (effect). In this process, the cause is partly
or wholly responsible for the effect. 
```

Let's take a closer look at the dangers of mistaking a *correlated*
relationship as a *causal* relationship between two variables. Shown below is a
scatterplot that builds off the `mpg` dataset we first discussed in Chapter
\@ref(dataviz). Using the `mpg` dataframe, we will plot the relationship
between the number of cylinders in an engine (`cyl`; the independent variable)
and that vehicle's fuel economy (`hwy`; the dependent variable).

```{r corr-example-3, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Scatterplot of Engine Displacement vs. Fuel Economy"}
# create scatterplot of engine displacement vs fuel economy
ggplot2::ggplot(data = (mpg),
                mapping = aes(x = cyl, 
                              y = hwy)) +
  geom_jitter(aes(x = cyl, 
                  y = hwy),
             shape = 1,
             color = "blue",
             width = 0.1,
             height = 0.05,
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE,
              size = 0.5,
              color = "blue") +
  xlab("Number of Cylinders") +
  ylab("Highway Fuel Economy (mi/gal)") + 
  theme_classic()
```

Looking at this plot, there appears a clear correlation between the number of
cylinders in a vehicle and its fuel efficiency. A linear fit through these data
gives a Pearson correlation coefficient of -0.76, which is not a perfect
relationship but a strong one, nonetheless. Does this mean that a 
*causal relationship* exists? If so, then we only need to mandate that all
future vehicles on the road be built with 4-cylinder engines, if we want more a
fuel-efficient fleet! That mandate, of course, would likely produce minimal
effect. Just because two variables are correlated doesn't mean that a change in
one will ***cause*** a change in the other.

Those who understand internal combustion know that the number of cylinders is a
design parameter related more to engine power than to engine efficiency. In
other words, the number of cylinders helps determine total displacement
volume. Indeed, the causal relationship for fuel efficiency, in terms of miles
traveled per gallon, is due more directly to the energy conversion efficiency of the engine,
vehicle drag coefficient, and vehicle mass. If you want more fuel-efficient
cars and trucks, you need more efficient engines that weigh less. In the 1990s
and early 2000s nearly all engine blocks were made from cast iron. Today,
nearly all engine blocks are made from aluminum. Can you guess why? 

## Exploring Multivariate Data

With *multivariate data*, we often consider more than just two variables;
however, visualizing more than two variables in a single plot can be
challenging. There are advanced statistical approaches to exploring such data,
including multivariate regression, principal component analysis, and machine
learning approaches, but these techniques are beyond the scope of this course.
Here, I will introduce a few graphical techniques that are useful for exploring
multivariate data.

### Facets

One easy way to evaluate two or more variables is to create multiple plots (or
facets) with the `ggplot2::facet()` function. This function creates a series of
plots, as panels, where each panel represents a different value (or level) of a
third variable of interest. For example, let's create a `ggplot` object from
the `mtcars` data set that explores the relationship between a vehicle's fuel
economy and its weight. First, let's create a simple bivariate scatterplot of
these data (`mpg` vs. `wt`) and fit a linear model through the data. (Note: we
haven't discussed modeling yet but more on that [later](#model)).

```{r facet-1, message=FALSE, warning=FALSE, fig.cap="Scatterplot of fuel economy vs. vehicle weight from the `mtcars` dataset."}
# fit a linear model; the ~ means we are modeling mpg as "y" and wt as "x"
g1_model <- lm(mpg ~ wt, data = mtcars)
# create a plot, assign it to an object named `g1`
g1 <- ggplot2::ggplot(data = mtcars, 
                      mapping = aes(x = wt, 
                                    y = mpg)) + 
  geom_point() +
  geom_smooth(model = g1_model, 
              method = "lm") + 
  ylab("Fuel Economy (mi/gal)") +
  xlab("Vehicle Weight (x1000 lb)") +
  theme_bw(base_size = 14)
g1
```

Looking back at Figure \@ref(fig:corr-example-3), we know that the number of
cylinders (`cyl`) is also associated with fuel efficiency, and many of the
vehicles from `mpg` have different `cyl` numbers. To examine these three
variables together (`mpg`, `wt`, and `cyl`), we can create a scatterplot that
is *faceted* according to the `cyl` variable. This is relatively easy to do in
`ggplot2` by adding a `facet_grid()` layer onto our ggplot object. The key
details to pass to `facet_grid()` are:  

  1. Whether we want to see the facets as **rows** or **columns**, and
  2. The variable being used to create the facets.

These two specifications can be made as a single argument to `facet_grid()` in
the form:

  - `facet_grid(rows = vars(variable))` *or*
  - `facet_grid(cols = vars(variable))`,
  where *variable* is the name of the column vector used to define the facets.
  Note: `vars()` is a `ggplot2::` function that allows you to specify a column variable by name (instead of using `data.frame$variable`) to define the facets, similar to how we define variables in the `aes()` function.

In this case, seeing the plots in columns seems fine, so we would add  
`facet_grid(cols = vars(cyl)` to the `ggplot` object `g1` as follows:

```{r facet-2, message=FALSE, warning=FALSE, fig.cap="Scatterplots of fuel economy vs. vehicle weight by number of cylinders in the engine (data from the `mtcars` dataset)."}
# facet previous plot by `cyl` columns and retain labels
g1 + facet_grid(cols = vars(cyl),
                labeller = label_both) # label each panel w/ variable name & value
```

Interestingly, but perhaps not surprising, we can see that the vehicles with
different cylinder numbers tend to have different fuel efficiency, but, even
within these facets, we still see a relationship between efficiency and vehicle
weight. Note that because the original `ggplot` object (`g1`) contained a linear model, the faceting call led to the creation of three (separate) linear models---one for each facet. 

Here are the same data in a plot that is faceted by rows instead of columns. Take note how `ggplot2::` allows you add the facet to the original plot object using a
`+`, as in: `g1 + facet_grid(...)`

```{r facet-3, message=FALSE, warning=FALSE, fig.cap="Scatterplots of fuel economy vs. vehicle weight by number of cylinders in the engine (data from the `mtcars` dataset)."}
# facet previous plot by`cyl` in rows and retain labels
g1 + facet_grid(rows = vars(cyl),
                labeller = label_both)
```

### Colors

We can also use color to indicate variation in data; this can be useful for
introducing a third variable into scatter, jitter, and time-series plots (or when plotting multiple boxplots, histograms, or cumulative distributions). When
introducing **color as a variable** into a plot, you must do so through an
*aesthetic*, such as `geom_point(aes(color = cyl))`.

Let's recreate Figure \@ref(fig:facet-1) and highlight the `cyl` variable using
different colors. The addition of color provides us with the same level of
insight as the facets above.

```{r colorplot-1, message=FALSE, warning=FALSE, fig.cap="Vehicle fuel economy vs. weight and colored by number of engine cylinders (data from mtcars)"}
# instruct R to treat the `cyl` variable as a factor with discrete levels
# this, in turn, tells ggplot2 to assign discrete colors to each level
# `cyl` as a factor with four levels
mtcars$cyl <- as.factor(mtcars$cyl)
# recreate previous plot with color option
g3 <- ggplot2::ggplot(data = mtcars,
                      mapping = aes(x = wt, 
                           y = mpg, 
                           color = cyl)) + 
  geom_point() +
  ylab("Fuel Economy (mi/gal)") +
  xlab("Vehicle Weight (x1000 lb)") +
  theme_bw(base_size = 14)
# call plot
g3
```

```{block, type="rmdwarning"}
When using color, be aware that many people are unable to distinguish red from
green or blue from yellow. Many options exist to avoid issues from color
blindness (e.g., `viridis` palette) and websites like
[color-blindness.com](https://www.color-blindness.com/coblis-color-blindness-simulator/){target="_blank"} 
allow you to upload image files so that you can see what your plot looks like to someone with color blindness.
```

Here is an updated version of Figure \@ref(fig:colorplot-1) that avoids issues
with color blindness and, better yet, differentiates the `cyl` variable with
both colors and symbols.

```{r colorplot-2, message=FALSE, warning=FALSE, fig.cap="Vehicle fuel economy vs. weight and colored by number of engine cylinders (data from mtcars)"}
# recreate previous plot with color-blind-friendly colors and shapes
ggplot2::ggplot(data = mtcars,
                mapping = aes(x = wt,
                              y = mpg,
                              color = cyl, 
                              shape = cyl)) + # distinguish by shape also!
  geom_point(size = 2.5) +
  ylab("Fuel Economy (mi/gal)") +
  xlab("Vehicle Weight (x1000 lb)") +
  scale_colour_manual(values = c("sandybrown", # color-blind-friendly colors
                                 "orangered", 
                                 "steelblue2")) +
  theme_bw(base_size = 14)
```

```{block, type="rmdtip"}
Whenever you use **color** to differentiate variables, it's a good idea to use symbols, too, if possible. 
```

## Chapter 7 Exercises

### Fuel economy data

This in-class exercise will conduct an exploratory, multivariate data 
analysis on vehicle fuel economy. We will begin by downloading a .zip file from
[fueleconomy.gov](https://www.fueleconomy.gov/feg/ws/){target="_blank"}, which
is a Federal program that tracks the fuel economy of all vehicles sold in the
United States. The .zip file contains a .csv with fuel economy information for
nearly every vehicle manufactured  between 1984 and today. We will use the
`readr` and `dplyr` packages to load and clean the data, respectively. A data
dictionary (something that defines and explains each variable in the dataset)
is also available at the website above.

### Import and wrangle data

The first code chunk will download the data directly into a temporary file
using `download.file()` from base R. We will then `unzip()` (base R) the temp
file into a .csv and use `readr` to read that .csv into a dataframe named
`raw_data`. 

```{r load-data, message=FALSE, cache=FALSE, warning=FALSE}
# create an empty temporary object to hold the zipped data
temp <- base::tempfile()
# download the file into temp object
utils::download.file(
  url = "https://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
  destfile = temp,
  mode="wb")
# unzip the folder within temp object to acess csv
temp2 <- utils::unzip(temp, 
                      "vehicles.csv", 
                      exdir = "./data/") # unzip .csv to local directory
# import csv into df object
raw_data <- readr::read_csv(temp2) #read the csv into a data frame
# delete the temp file
base::unlink(temp) 
# remove the two temp objects from local environment
base::rm(temp, temp2) 
```

**Public Service Announcement**: Notice the use of `rm()` from base R in the 
above code. This is an example of a recommended use case of this function;
do not EVER use this function to "restart" your R session. If you do, [Jenny
Bryan will find you and throw your computer out the window](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/){target="_blank"}.

Looking at the `raw_data` dataframe, we see there are 83 variables with over 
45,000 observations.  That's a LOT of vehicles! In most analyses of large
datasets, we don't need to inspect every variable. Let's create a vector of
variables (`vars_needed`) that we want to keep and pass that vector to
`dplyr::select()` to retain only the variables we want. To pass a character 
vector as an argument to `dplyr::select()`, instead of just a single column
name, we use the `all_of()` function, which is an argument modifier from the
`tidyselect` R package. You can type `?tidyselect::all_of` in the R console to
learn more. Essentially, `tidyselect::all_of()` tells `dplyr::select()` to 
expect a character vector of column names to retain in the datset.

```{r fuel-econ-cleaning-1}
# create a vector of var names to retain 
vars_needed <- c("id",
                 "make",
                 "model",
                 "year",
                 "cylinders", 
                 "displ", 
                 "drive",
                 "trany",
                 "VClass",
                 "fuelType1",
                 "comb08",
                 "highway08",
                 "city08")
# select necessary variables
df_mpg <- raw_data %>%
  dplyr::select(tidyselect::all_of(vars_needed))  
# remove full dataframe from environment
rm(raw_data) 
```

As you have seen before, some of these variables can be coded as **factors**,
which are *categorical* variables that can be classified into discrete levels.
For example, there are a finite number of vehicle transmission (`trany`) or
drivetrain (`drive`) types on the market, and, by telling R to code these data
as factors, we can analyze these variables in categorical form.

First, we will create a vector of variable names that we want to code as 
factors, `vars_factr`. Then we will apply the `as.factor()` function to those 
variables using `dplyr::mutate(dplyr::across())`. The `across()` function from
`dplyr` allows one to apply the same transformation to multiple columns in a
dataframe and is similar to how we used `tidyselect::all_of()` in the previous
example. We will also take the opportunity to rename a few of these variables,
following our naming guidelines discussed in earlier chapters, and to filter
the data to retain only vehicles made after the year 2000.

```{r fuel-econ-cleaning-2}
# identify the columns that we want to convert to a factor
vars_factr <- c("make", "drive", "trany", "VClass", "fuelType1")
# overwrite existing df
df_mpg <- df_mpg %>% 
  # convert select vars to factors 
  dplyr::mutate(dplyr::across(tidyselect::all_of(vars_factr), 
                              .fns = as.factor)) %>% 
  # overwrite column names with simpler names
  dplyr::rename(fuel_type = fuelType1, 
                cyl = cylinders, 
                tran = trany, 
                v_class = VClass) %>% # easier string to type
  # keep only data collected after 2000 for the sake of simplicity
  dplyr::filter(year >= 2000) 
# remove previous objects from global environment
rm(vars_needed, vars_factr)
```

### Check data

Begin as we always do, by simply looking at some of the data.

```{r df-mpg-desc-1}
# examine first five rows of df
head(df_mpg)
```

Next, let's take a look at some of the factor levels. There are lots of ways
to do this in R, but the `levels()` function is the most straightforward.

```{r df-mpg-desc-2}
# print a character vector of levels for the drive variable in df_mpg
levels(df_mpg$drive) 
```

```{r df-mpg-desc-3}
# print a character vector of levels for the fuel_type variable in df_mpg
levels(df_mpg$fuel_type) 
```

```{r df-mpg-desc-4}
# print a character vector of levels for the v_class variable in df_mpg
levels(df_mpg$v_class) 
```

### Check for missing data

Next, let's see whether this dataframe contains missing data (`NA`s).  

``` {r mpg-na}
# count instances of missing values within df_mpg
sum(is.na(df_mpg))

# note this can work as a pipe, too
# df_mpg %>% is.na() %>% sum()
```

With a dataframe of this size, we shouldn't be surprised that there are 
`r sum(is.na(df_mpg))` `NA` values present. The next question is: where do
these `NA` values show up? There are several ways to answer this question;
here, we will use the `stats::complete.cases()` function with a
`dplyr::filter()` search. 

#### Example 1: Missing Data
The `compelete.cases()` function returns a logical vector indicating
which rows are *complete* (i.e., no missing values). The opposite of this
logical function, `!complete.cases()`, should return ONLY those rows that do 
contain `NA`s. Let's create a subset of `df_mpg` that only contains rows with
`NA` present.

```{r complete-cases-NA}
# return a df of rows with missing values
df_mpg %>%
  filter(!complete.cases(.))
```

Here, we discover that most of the the `NA` values are in the `cyl`, `displ`, 
and `tran`, columns. Further, we see that **all** of these vehicles have a 
`fuel_type` of *Electricity*, which makes sense because electric vehicles (EVs) do not
have internal combustion. This may be a variable level that we choose to
exclude from certain analyses later... 

**You might have noticed the "dot", `.`, used at the end of the `dplyr::filter` pipe above:**  
  
`df_mpg %>% filter(!complete.cases(.))`  
  
*Why is there a `.` as an argument to a function?*  

We use `.` as a [*placeholder argument*](https://magrittr.tidyverse.org/reference/pipe.html#arguments) when more than one function is nested within a single section of pipe (`%>%`).  The `complete.cases()` function requires an argument to run, so we must tell it to operate on the same dataframe on which `filter()` operates. The `.` accomplishes this need for an argument.  In other words, these three lines of code are equivalent:  

```{r magrittr-dot, eval=FALSE}
df_mpg %>% filter(!complete.cases(.))

df_mpg %>% filter(!complete.cases(df_mpg))

filter(df_mpg, !complete.cases(df_mpg))
```

***Note:*** the `.` only works as a placeholder argument within a `%>%` call. The following code will throw an error because there is no pipe present:

```{r bad-code, eval=FALSE}
#this code will throw an error
filter(df_mpg, !complete.cases(.))  
```
#### Example 2: Missing Data

Filter the `df_mpg` data for *any variables* (`dplyr::any_vars()`) that contain
`NA`. Like `tidyselect::all_of()` that was used to clean the dataframe above,
the `any_vars()` function is a helper function designed for use within `dplyr`
and `tidyr` verbs.   
<br>*A word to the wise*: `dplyr::any_vars()` and similar variants 
have been **superseded**, which means new updates are not being made to them, and
the developers will eventually encourage users to transition to using
`dplyr::across()`. If you are interested, take a look at [this discussion thread](https://community.rstudio.com/t/using-filter-with-across-to-keep-all-rows-of-a-data-frame-that-include-a-missing-value-for-any-variable/68442)
on the topic.  You can also type `vignette("colwise")` into the console to introduce
yourself to the `across()` function.  In the code below, I provide examples of the
*superseded* code and the more current (updated) code.  

```{r df-mpg-desc-6}
# (superseded) filter the df for rows where any of the observations contain NA
# df_mpg %>% 
#  dplyr::filter_all(dplyr::any_vars(is.na(.)))

# (updated) example to filter the df for rows containing NA 
df_mpg %>%
  filter(if_any(.cols = everything(), 
                .fns = ~ is.na(.x)))

```

#### Example 3: Missing Data

Note: In [Chapter 8](#rprog4), you will learn to "map" the `sum()` and `is.na()` functions to each column of the data frame using `map_dfc` from the `purrr` package, which is designed to apply one ore more functions across columns of a data frame. This approach is the recommended way and I show it mainly as a preview of things to come in [Chapter 8](#rprog4).

``` {r df-mpg-desc-5, eval=FALSE}
# preview of code we will learn in Chapter 8
df_mpg %>% purrr::map_dfc(~sum(is.na(.)))
```

### Visualize data

#### Tidy data

Our primary variable of interest is fuel economy, so let's begin by visualizing
the location, dispersion, and shape of `city08`, `highway08`, and `comb08`, 
which represent city, highway and combined fuel economy estimates. Note that 
`df_mpg` is not in a tidy format, as fuel economy data is presented in three
different columns. We'll fix that with `tidyr::pivot_longer()`.

```{r tidy-mpg}
# before visualizing the data, convert to tidy format
tidy_mpg <- df_mpg %>%
  # choose simple names
  dplyr::rename(city = city08,
                hwy = highway08,
                comb = comb08) %>%
  # change data structure from wide to long 
  tidyr::pivot_longer(cols = c("city", "hwy", "comb"),
                      names_to = "metric",
                      values_to = "mpg") %>%
  # switch var to factor
  dplyr::mutate(metric = as.factor(metric))
```

#### Multivariate EDA options

With the data properly tidy, we can use our basic EDA plots with `stat_ecdf()`,
`geom_boxplot()`, and `geom_histogram()`, to visualize the fuel economy data.  
To show the different variables within the same plot, we use `color = ` 
or `fill = ` as an extra aesthetic. All three plots are then laid out using
`gridExtra::grid.arrange()`. 

```{r tidy-mpg-plot, fig.cap="Cumulative Distribution, Histogram, and Boxplots of 21st Centurty Vehicle Fuel Efficiency.", fig.align="center", fig.height=7}
# create cumulative distribution function plot
ecdf <- ggplot2::ggplot(data = tidy_mpg, 
                        mapping = aes(x = mpg, 
                                      color = metric)) +
  stat_ecdf() +
  theme_bw() +
  scale_x_log10() +
  xlab(NULL) +
  ylab("Quantile") 
# create boxplot
box <- ggplot2::ggplot(data = tidy_mpg, 
                       mapping = aes(x = mpg, 
                                     fill = metric, 
                                     y = metric)) +
  geom_boxplot(outlier.alpha = 0.05) +
  theme_bw() +
  scale_x_log10() +
  xlab(NULL) +
  ylab("Metric") 
# create histogram
hist <- ggplot2::ggplot(data = tidy_mpg, 
                       mapping = aes(x = mpg, 
                                     fill = metric)) +
  geom_histogram(bins = 35,
                 alpha = 0.75,
                 position = "stack") +
  theme_bw() +
  scale_x_log10() +
  xlab("Fuel Economy, mi/gal") +
  ylab("Counts")
# embed plots into one figure
gridExtra::grid.arrange(ecdf, box, hist, 
                        widths = c(0.4,1,0.4),
                        layout_matrix = rbind(c(NA, 1, NA),
                                              c(NA, 2, NA),
                                              c(NA, 3, NA)))
```

### Multivariate time series 

Next, let's look at a rough time series (by year) of all the combined fuel 
economy values, `comb08`, for all vehicle observations. The fuel economy data 
will be shown with boxplots, and we will use `group = year` as an aesthetic to 
show the overall time series. We will use a log-scale y-axis due to the large
variation expected, and outliers will be made more transparent to soften their
effect.

```{r fuel-econ-ts, warning=FALSE, fig.cap="Boxplots of fleet-wide fuel efficiency by year.", fig.align="center"}
# create time series plot 
e1 <- ggplot2::ggplot(data = df_mpg, 
                      mapping = aes(x = year, 
                                    y = comb08)) +
  geom_boxplot(aes(group = year),
               fill = "skyblue",
               outlier.alpha = 0.1) +
  scale_y_log10(limits = c(10,150)) +
  ylab("Combined Fuel Efficiency (mi/gal)") +
  xlab("") +
  theme_bw(base_size = 14)
# call plot
e1
```

#### Consider context

[This Department of Energy website](https://afdc.energy.gov/data/10562) outlines
the *Corporate Average Fuel Economy* (CAFE) standards from the Environmental
Protection Agency that require vehicles to meet set fuel economy levels, in 
terms of miles per gallon, or "mpg", across the "fleet" of available vehicles.
Let's load a .csv file named `cafe` and look at the requirements by year.

```{r CAFE-stds, message = FALSE}
# import data
cafe <- read_csv("./data/CAFE_stds.csv", col_names = c("year", "mpg_avg"), skip = 1)
# plot cafe data
cafe.plot <- ggplot2::ggplot(data = cafe, 
                    mapping = aes(group = year, 
                                  y = mpg_avg, 
                                  x = year)) + 
  geom_col(fill = "maroon") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, 
                                   hjust = 1)) +
  labs( y = "Required Average Fuel Efficiency (mpg)", 
        x = "Year", 
        title = "Federal Combined Average Fuel Economy (CAFE) Standards")
# call cafe plot
cafe.plot 
```

This is a nice 
[explanatory story](https://www.edmunds.com/fuel-economy/faq-new-corporate-average-fuel-economy-standards.html) 
of why the average fuel economy numbers don't match well to the published CAFE
standards. 

### Density plots

A ***density plot*** is akin to a *smoothed histogram*, where the y-axis depicts the frequency (or probability of occurrence) and the x-axis represents the magnitude of the observed variable.  They are called though `ggplot2::geom_density` and `ggplot2::stat_density`.  Density plots are useful when you have a lot of data (typically hundreds to thousands of observations) because they allow you to visualize the **shape** of a distribution - namely, it's central tendency(ies), dispersion, and skew.

Create a series of geom_density plots showing combined fuel economy `comb08` 
across all vehicles and years as a function of `fuel_type`. Break out the 
different `fuel_type` categories into facets (ncol = 3).

```{r eda-fuel-1}
# density plot of combined fuel economy by fuel type with facets
g1 <- ggplot2::ggplot(data = df_mpg) +
  geom_density(aes(x = comb08, # notice the different location option for x aes
                   fill = fuel_type)) + 
  facet_wrap(~ fuel_type,
             ncol = 3) +
  scale_x_log10() +
  theme_bw(base_size = 14) +
  theme(legend.position = "none")
# call plot
g1
```

Now show the same plot without facets.

```{r eda-fuel-2}
# density plot of combined fuel economy by fuel type without facets
g2 <- ggplot2::ggplot(data = df_mpg, 
                      mapping = aes(x = comb08)) + # now x aes is here 
  geom_density(aes(fill = fuel_type),
               position = "identity",
               alpha = 0.6,
               adjust = 1) +
  scale_x_log10() +
  theme_bw() 
# call plot
g2
```

I like these plots because they lead to more questions (and that's the *point* of exploratory data analysis)!  Why are some of the categories bimodal? Why do many natural gas vehicles tend to have the lowest fuel efficiency?  Are any of these vehicles tested on multiple fuel types?  What is the most fuel efficient vehicle sold today?

### More multivariate plotting

Next, let's examine the effect of vehicle class on combined fuel economy for a
single year: 2023. Note, for the plot below, since *vehicle class* (`v_class`) is a `factor` type of variable, I can structure the plot to show `v_class` from lowest median efficiency to highest median efficiency (where the median is taken from each class based on the `highway_08` variable). This is accomplished with the `fct_reorder()` function from the `forcats::` package: `forcats::fct_reorder(.f = v_class, .x = highway, .fun = median)`.

```{r v-class, fig.cap="Highway Fuel Economy for 2023 Vehicles by Type", fig.align="center"}
# filter to year 2023 and reorder factor levels 
df_mpg.2023 <- df_mpg %>%
  dplyr::filter(year == 2023) %>%
  dplyr::mutate(v_class = forcats::fct_reorder(.f = v_class, 
                                               .x = highway08, 
                                               .fun = median))
# create plot of 2023 combined fuel economy
g1 <- ggplot2::ggplot(data = df_mpg.2023) + 
  geom_boxplot(aes(x = highway08, 
                   color = v_class, 
                   y = v_class)) + 
  scale_x_log10() +
  theme_bw() +
  ylab("") +
  xlab("Highway Fuel Economy") +
  theme(legend.position = "none")
# call plot
g1
```

### Practice questions

Finally, let's ask some simple questions and use some basic data wrangling to 
get the answers.

**Question 1**

Among 4-cylinder vehicles with Front-Wheel Drive, what make/model has the best highway fuel economy in 2018?

```{r Q1}
df_mpg %>%
  dplyr::filter(cyl == 4, 
                drive == "Front-Wheel Drive", 
                year == 2018) %>%
  dplyr::slice_max(order_by = highway08, 
                   n = 1) %>%
  dplyr::select(make, model, drive, year, highway08)
```
  
**Question 2**

Among 8-cylinder vehicles with Rear-Wheel Drive, what make/model has the 
worst city fuel economy in 2019?

```{r Q2}
df_mpg %>%
  dplyr::filter(cyl == 8,
                drive == "Rear-Wheel Drive",
                year == 2019) %>%
  dplyr::slice_min(order_by = city08) %>%
  dplyr::select(make, model, drive, year, city08)
```

```{r Q2a, echo=FALSE, out.width="450pt", fig.align="center", fig.cap="The Bentley Mulsanne: $330k lets you park it on the sidewalk!"}
knitr::include_graphics("./images/Bentley_Mulsanne_2019.jpg")
```

**Question 3**

Among 2021 vehicles from the `tidy_mpg` data frame, rank the highest-performing vehicle from each manufacturer in terms of fuel efficiency (`mpg`).  Pass the search result into a table using the `kable()` function.

```{r Q3}
tidy_mpg %>%
  filter(year == 2021) %>%
  group_by(make) %>%
  slice_max(mpg, n =1,  with_ties = FALSE) %>%
  select(make, model, mpg) %>%
  arrange(-mpg) %>%
  kable(format = "html")

```

## Chapter 7 Homework

You will continue to explore data from the National Science Foundation 2017
[National Survey of Graduates](https://www.nsf.gov/statistics/srvygrads/){target="_blank"}, which were introduced in an earlier chapter.

From Canvas, download the R Markdown template and the reduced dataset (.csv)
containing salary, age, and gender information. Your knitted submission is due
at the start of the class for the exam review.