-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-multivar_eda.Rmd
executable file
·915 lines (773 loc) · 40.6 KB
/
07-multivar_eda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
# Multivariate Data Exploration {#eda2}
```{r eda2, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center')
library(tidyverse)
library(knitr)
library(scales)
library(GGally)
library(kableExtra)
library(gridExtra)
```
## Chapter 7 Objectives
This Chapter is designed around the following learning objectives. Upon
completion, you should be able to:
- Understand the use and context of bivariate data
- Recognize the motivation behind a scatterplot from different perspectives
- Define correlation, causation, and the difference
- Create and interpret a scatterplot between two variables
- Leverage `ggplot2` for multivariate exploratory data analysis through facets and colors
## Bivariate Data {#bivariate}
Whereas univariate data analyses are directed at "getting to know" the
observations made for a single variable, bivariate---and multivariate---
analyses are designed to examine the *relationship* that may exist between two
(or more) variables. Like the Chapter on [Univariate EDA]({#eda1), we will
focus first on ***data exploration***, which is a key step towards "getting to
know" your data and one that should always proceed inferential statistics, or
*making conclusions about your data*.
```{block, type="rmdnote"}
Bivariate means *two variables* where the observations are paired; each
observation samples both variables so that they are linked.
```
## Scatterplot {#scatt}
Undoubtedly, you have seen scatterplots many times before; we will discuss them
in more detail here. The **scatterplot** allows you to assess the strength,
direction, and type of relationship between two variables. This can be
important for determining factors like:
* Correlation
* Linearity
* Performance (of a measurement) in terms of precision, bias, and dynamic range
Traditionally, a scatterplot shows paired observations of two variables with
the ***dependent variable*** on the y-axis and the ***independent variable***
on the x-axis. Creating a plot in this way means that, before you begin, you
must make a judgment call about the direction of the relationship (i.e., which
variable *depends* on which). This relates to the scientific method and
*linear regression*; we will discuss the latter in more detail later. For the
purposes of exploratory data analysis, however, it actually doesn't matter
which variable goes on which axis. That said, since we don't wish to break with
tradition, let's agree to follow the guidelines on independent (predictor) and
dependent (outcome/response) variables. Although related, the researcher's
purpose or "mode" affects the intention behind a plot:
- **Scientific Method:** The experimenter manipulates the control variable (independent; on x-axis) and observes the changes in the response variable (dependent; y-axis).
- **Statistics:** The independent variable (x-axis) is thought to have some predictive ability, correlation with, or control over the dependent variable (y-axis).
- **Exploratory Data Analysis:** We throw two variables on a plot to investigate their relationship. We make a guess about which is the independent variable (x-axis) and which is the dependent variable (y-axis), and we hope that nobody calls us out if we got it wrong...
Scatterplots are created using `geom_point()` by naming `x = ` and `y = ` variables within the `aes()` function.
### Causality
All this talk about dependent and independent variables is fundamentally rooted
in the practice of ***causal inference*** reasoning, which is the ability to
say that "action A *caused* outcome B". Discovering---*or proving*---that one
thing caused another to happen can be incredibly powerful. Proving causality
leads to Nobel Prizes, creation of new laws and regulations, judgment of guilt
or innocence in court, changing human behavior and convincing human minds, and,
simply put, more understanding.
A full treatment of causal inference reasoning is beyond the scope of this
course, but we will, from time to time, delve into this topic. The art of data
science can be a beautiful and compelling way to demonstrate causality, but we
need to learn to crawl before we can walk, run, or fly. For now, let's put
aside the pursuit of causation and begin with ***correlation***.
### Correlation {#corr}
The ***scatterplot*** is a great way to visualize *whether*, and, to some
extent, *how*, two variables are related to each other.
```{block, type="rmdnote"}
Correlation: A mutual relationship or connection between two or more things;
the process of establishing a relationship or connection between two or more
measures. The variables can move up or down together or be inversely related.
```
```{r corr_plots1, warning=FALSE, echo=FALSE, include=FALSE}
# set seed to repeat values
set.seed(7)
# generate data for corr plots
corr_plots1 <- tibble::tibble(x = runif(100, 0, 100),
y1 = x,
y2 = purrr::map_dbl(x,
~ (.x*(rnorm(1,
mean = 1,
sd = 0.06)) +
rnorm(1,
mean = 0,
sd = 5))),
y3 = purrr::map_dbl(x,
~ (.x*(rnorm(1,
mean = 0.6,
sd = 0.2)) +
rnorm(1,
mean = 15,
sd = 15))),
y4 = runif(100, 0, 105)) %>%
tidyr::pivot_longer(y1:y4, names_to = "type") %>%
dplyr::mutate(type = as.factor(type))
# label types of correlation
levels(corr_plots1$type) <- c("Perfect Correlation",
"Strong Correlation",
"Moderate Correlation",
"No Correlation")
# create corr plot panel
p1 <- ggplot2::ggplot(data = corr_plots1,
mapping = aes(x = x,
y = value)) +
geom_point(aes(color = type),
alpha = 0.5) +
facet_wrap(facets = vars(type),
ncol = 2) +
theme_bw() +
coord_fixed(ratio = 1.0) +
labs(x = "Independent Variable (x-axis)",
y = "Dependent Variable (y-axis)") +
theme(legend.position = "none")
# save plot to embed
ggsave("./images/correlation_example_1.png", dpi = 150)
```
Below are four examples of bivariate data (shown in scatterplots) with differing degrees of
correlation: perfect, strong, moderate, and none. These are qualitative terms,
of course; what is "moderate" to one person may be poor and unacceptable to
another. The qualitative strength of the correlation also depends on the
research context.
If you want to understand how to assess the strength of correlation
quantitatively, you can explore the Pearson Correlation Coefficient (***r***)
in the [Appendix](#pearson), which is used to quantify the degree of *linear*
correlation between two variables.
```{r corr-example-1, echo=FALSE, out.width="700pt", fig.align="center", fig.cap="Scatterplot examples showing bivariate data with varying degrees of correlation."}
knitr::include_graphics("./images/correlation_example_1.png")
```
```{r corr-plots2, warning=FALSE, echo=FALSE, include=FALSE}
# generate data for corr strength plots
corr_plots2 <- tibble::tibble(x = runif(100, 0, 100),
y1 = purrr::map_dbl(x,
~ (.x*(rnorm(1,
mean = .8,
sd = 0.15)) +
rnorm(1,
mean = 10,
sd = 10))),
y2 = purrr::map_dbl(x,
~ -(.x*(rnorm(1,
mean = 0.6,
sd = 0.06)) -
rnorm(1,
mean = 70,
sd = 5))),
y3 = purrr::map_dbl(x,
~ (.x*(rnorm(1,
mean = 1,
sd = 0.02))^2 +
rnorm(1,
mean = 0,
sd = 2))),
y4 = purrr::map_dbl(x,
~ (100*exp(-.x/20)*
(rnorm(1,
mean = 1,
sd = 0.03)) +
rnorm(1,
mean = 4,
sd = 2)))
) %>%
tidyr::pivot_longer(y1:y4, names_to = "type") %>%
dplyr::mutate(types = as.factor(type))
# label types of correlation
levels(corr_plots2$types) <- c("Positive Correlation",
"Negative Correlation",
"Linear Correlation",
"Non-Linear Correlation")
# create corr plot panel
p2 <- ggplot2::ggplot(data = corr_plots2,
mapping = aes(x = x,
y = value)) +
geom_point(aes(color = types,
alpha = 0.5)) +
facet_wrap(facets = vars(types)) +
theme_bw() +
coord_fixed(ratio = 1) +
theme(legend.position = "none") +
labs(x = "Independent Variable",
y = "Dependent Variable")
# save plot to embed
ggsave("./images/correlation_example_2.png", dpi = 150)
```
In addition to the strength of the correlation, the sign and form of the
correlation can vary, too:
- **positive correlation**: the dependent variable *trends in the same direction* as the independent variable; when `y` increases, `x` increases, too.
- **negative correlation**: the dependent variable *decreases* when the independent variable *increases*
- **linear correlation**: the relationship between the two variables can be shown with a straight line
- **non-linear correlation**: the relationship between the two variables is curvilinear
```{r corr-example-2, echo=FALSE, out.width="700pt", fig.align="center", fig.cap="Scatterplot examples showing bivariate data with varying types of correlation."}
knitr::include_graphics("./images/correlation_example_2.png")
```
### Correlation $\neq$ causation
Did you know that being a smoker is correlated with having a lighter in your
pocket? Furthermore, it can be shown that keeping a lighter in your pocket is
correlated with an increased risk of developing heart disease and lung cancer.
Does this mean lighters in your pocket *cause* lung cancer?
```{block, type="rmdnote"}
Causation: the process or condition by which one event (cause) contributes to
the occurrence of another event (effect). In this process, the cause is partly
or wholly responsible for the effect.
```
Let's take a closer look at the dangers of mistaking a *correlated*
relationship as a *causal* relationship between two variables. Shown below is a
scatterplot that builds off the `mpg` dataset we first discussed in Chapter
\@ref(dataviz). Using the `mpg` dataframe, we will plot the relationship
between the number of cylinders in an engine (`cyl`; the independent variable)
and that vehicle's fuel economy (`hwy`; the dependent variable).
```{r corr-example-3, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Scatterplot of Engine Displacement vs. Fuel Economy"}
# create scatterplot of engine displacement vs fuel economy
ggplot2::ggplot(data = (mpg),
mapping = aes(x = cyl,
y = hwy)) +
geom_jitter(aes(x = cyl,
y = hwy),
shape = 1,
color = "blue",
width = 0.1,
height = 0.05,
size = 2) +
geom_smooth(method = "lm",
se = FALSE,
size = 0.5,
color = "blue") +
xlab("Number of Cylinders") +
ylab("Highway Fuel Economy (mi/gal)") +
theme_classic()
```
Looking at this plot, there appears a clear correlation between the number of
cylinders in a vehicle and its fuel efficiency. A linear fit through these data
gives a Pearson correlation coefficient of -0.76, which is not a perfect
relationship but a strong one, nonetheless. Does this mean that a
*causal relationship* exists? If so, then we only need to mandate that all
future vehicles on the road be built with 4-cylinder engines, if we want more a
fuel-efficient fleet! That mandate, of course, would likely produce minimal
effect. Just because two variables are correlated doesn't mean that a change in
one will ***cause*** a change in the other.
Those who understand internal combustion know that the number of cylinders is a
design parameter related more to engine power than to engine efficiency. In
other words, the number of cylinders helps determine total displacement
volume. Indeed, the causal relationship for fuel efficiency, in terms of miles
traveled per gallon, is due more directly to the energy conversion efficiency of the engine,
vehicle drag coefficient, and vehicle mass. If you want more fuel-efficient
cars and trucks, you need more efficient engines that weigh less. In the 1990s
and early 2000s nearly all engine blocks were made from cast iron. Today,
nearly all engine blocks are made from aluminum. Can you guess why?
## Exploring Multivariate Data
With *multivariate data*, we often consider more than just two variables;
however, visualizing more than two variables in a single plot can be
challenging. There are advanced statistical approaches to exploring such data,
including multivariate regression, principal component analysis, and machine
learning approaches, but these techniques are beyond the scope of this course.
Here, I will introduce a few graphical techniques that are useful for exploring
multivariate data.
### Facets
One easy way to evaluate two or more variables is to create multiple plots (or
facets) with the `ggplot2::facet()` function. This function creates a series of
plots, as panels, where each panel represents a different value (or level) of a
third variable of interest. For example, let's create a `ggplot` object from
the `mtcars` data set that explores the relationship between a vehicle's fuel
economy and its weight. First, let's create a simple bivariate scatterplot of
these data (`mpg` vs. `wt`) and fit a linear model through the data. (Note: we
haven't discussed modeling yet but more on that [later](#model)).
```{r facet-1, message=FALSE, warning=FALSE, fig.cap="Scatterplot of fuel economy vs. vehicle weight from the `mtcars` dataset."}
# fit a linear model; the ~ means we are modeling mpg as "y" and wt as "x"
g1_model <- lm(mpg ~ wt, data = mtcars)
# create a plot, assign it to an object named `g1`
g1 <- ggplot2::ggplot(data = mtcars,
mapping = aes(x = wt,
y = mpg)) +
geom_point() +
geom_smooth(model = g1_model,
method = "lm") +
ylab("Fuel Economy (mi/gal)") +
xlab("Vehicle Weight (x1000 lb)") +
theme_bw(base_size = 14)
g1
```
Looking back at Figure \@ref(fig:corr-example-3), we know that the number of
cylinders (`cyl`) is also associated with fuel efficiency, and many of the
vehicles from `mpg` have different `cyl` numbers. To examine these three
variables together (`mpg`, `wt`, and `cyl`), we can create a scatterplot that
is *faceted* according to the `cyl` variable. This is relatively easy to do in
`ggplot2` by adding a `facet_grid()` layer onto our ggplot object. The key
details to pass to `facet_grid()` are:
1. Whether we want to see the facets as **rows** or **columns**, and
2. The variable being used to create the facets.
These two specifications can be made as a single argument to `facet_grid()` in
the form:
- `facet_grid(rows = vars(variable))` *or*
- `facet_grid(cols = vars(variable))`,
where *variable* is the name of the column vector used to define the facets.
Note: `vars()` is a `ggplot2::` function that allows you to specify a column variable by name (instead of using `data.frame$variable`) to define the facets, similar to how we define variables in the `aes()` function.
In this case, seeing the plots in columns seems fine, so we would add
`facet_grid(cols = vars(cyl)` to the `ggplot` object `g1` as follows:
```{r facet-2, message=FALSE, warning=FALSE, fig.cap="Scatterplots of fuel economy vs. vehicle weight by number of cylinders in the engine (data from the `mtcars` dataset)."}
# facet previous plot by `cyl` columns and retain labels
g1 + facet_grid(cols = vars(cyl),
labeller = label_both) # label each panel w/ variable name & value
```
Interestingly, but perhaps not surprising, we can see that the vehicles with
different cylinder numbers tend to have different fuel efficiency, but, even
within these facets, we still see a relationship between efficiency and vehicle
weight. Note that because the original `ggplot` object (`g1`) contained a linear model, the faceting call led to the creation of three (separate) linear models---one for each facet.
Here are the same data in a plot that is faceted by rows instead of columns. Take note how `ggplot2::` allows you add the facet to the original plot object using a
`+`, as in: `g1 + facet_grid(...)`
```{r facet-3, message=FALSE, warning=FALSE, fig.cap="Scatterplots of fuel economy vs. vehicle weight by number of cylinders in the engine (data from the `mtcars` dataset)."}
# facet previous plot by`cyl` in rows and retain labels
g1 + facet_grid(rows = vars(cyl),
labeller = label_both)
```
### Colors
We can also use color to indicate variation in data; this can be useful for
introducing a third variable into scatter, jitter, and time-series plots (or when plotting multiple boxplots, histograms, or cumulative distributions). When
introducing **color as a variable** into a plot, you must do so through an
*aesthetic*, such as `geom_point(aes(color = cyl))`.
Let's recreate Figure \@ref(fig:facet-1) and highlight the `cyl` variable using
different colors. The addition of color provides us with the same level of
insight as the facets above.
```{r colorplot-1, message=FALSE, warning=FALSE, fig.cap="Vehicle fuel economy vs. weight and colored by number of engine cylinders (data from mtcars)"}
# instruct R to treat the `cyl` variable as a factor with discrete levels
# this, in turn, tells ggplot2 to assign discrete colors to each level
# `cyl` as a factor with four levels
mtcars$cyl <- as.factor(mtcars$cyl)
# recreate previous plot with color option
g3 <- ggplot2::ggplot(data = mtcars,
mapping = aes(x = wt,
y = mpg,
color = cyl)) +
geom_point() +
ylab("Fuel Economy (mi/gal)") +
xlab("Vehicle Weight (x1000 lb)") +
theme_bw(base_size = 14)
# call plot
g3
```
```{block, type="rmdwarning"}
When using color, be aware that many people are unable to distinguish red from
green or blue from yellow. Many options exist to avoid issues from color
blindness (e.g., `viridis` palette) and websites like
[color-blindness.com](https://www.color-blindness.com/coblis-color-blindness-simulator/){target="_blank"}
allow you to upload image files so that you can see what your plot looks like to someone with color blindness.
```
Here is an updated version of Figure \@ref(fig:colorplot-1) that avoids issues
with color blindness and, better yet, differentiates the `cyl` variable with
both colors and symbols.
```{r colorplot-2, message=FALSE, warning=FALSE, fig.cap="Vehicle fuel economy vs. weight and colored by number of engine cylinders (data from mtcars)"}
# recreate previous plot with color-blind-friendly colors and shapes
ggplot2::ggplot(data = mtcars,
mapping = aes(x = wt,
y = mpg,
color = cyl,
shape = cyl)) + # distinguish by shape also!
geom_point(size = 2.5) +
ylab("Fuel Economy (mi/gal)") +
xlab("Vehicle Weight (x1000 lb)") +
scale_colour_manual(values = c("sandybrown", # color-blind-friendly colors
"orangered",
"steelblue2")) +
theme_bw(base_size = 14)
```
```{block, type="rmdtip"}
Whenever you use **color** to differentiate variables, it's a good idea to use symbols, too, if possible.
```
## Chapter 7 Exercises
### Fuel economy data
This in-class exercise will conduct an exploratory, multivariate data
analysis on vehicle fuel economy. We will begin by downloading a .zip file from
[fueleconomy.gov](https://www.fueleconomy.gov/feg/ws/){target="_blank"}, which
is a Federal program that tracks the fuel economy of all vehicles sold in the
United States. The .zip file contains a .csv with fuel economy information for
nearly every vehicle manufactured between 1984 and today. We will use the
`readr` and `dplyr` packages to load and clean the data, respectively. A data
dictionary (something that defines and explains each variable in the dataset)
is also available at the website above.
### Import and wrangle data
The first code chunk will download the data directly into a temporary file
using `download.file()` from base R. We will then `unzip()` (base R) the temp
file into a .csv and use `readr` to read that .csv into a dataframe named
`raw_data`.
```{r load-data, message=FALSE, cache=FALSE, warning=FALSE}
# create an empty temporary object to hold the zipped data
temp <- base::tempfile()
# download the file into temp object
utils::download.file(
url = "https://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
destfile = temp,
mode="wb")
# unzip the folder within temp object to acess csv
temp2 <- utils::unzip(temp,
"vehicles.csv",
exdir = "./data/") # unzip .csv to local directory
# import csv into df object
raw_data <- readr::read_csv(temp2) #read the csv into a data frame
# delete the temp file
base::unlink(temp)
# remove the two temp objects from local environment
base::rm(temp, temp2)
```
**Public Service Announcement**: Notice the use of `rm()` from base R in the
above code. This is an example of a recommended use case of this function;
do not EVER use this function to "restart" your R session. If you do, [Jenny
Bryan will find you and throw your computer out the window](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/){target="_blank"}.
Looking at the `raw_data` dataframe, we see there are 83 variables with over
45,000 observations. That's a LOT of vehicles! In most analyses of large
datasets, we don't need to inspect every variable. Let's create a vector of
variables (`vars_needed`) that we want to keep and pass that vector to
`dplyr::select()` to retain only the variables we want. To pass a character
vector as an argument to `dplyr::select()`, instead of just a single column
name, we use the `all_of()` function, which is an argument modifier from the
`tidyselect` R package. You can type `?tidyselect::all_of` in the R console to
learn more. Essentially, `tidyselect::all_of()` tells `dplyr::select()` to
expect a character vector of column names to retain in the datset.
```{r fuel-econ-cleaning-1}
# create a vector of var names to retain
vars_needed <- c("id",
"make",
"model",
"year",
"cylinders",
"displ",
"drive",
"trany",
"VClass",
"fuelType1",
"comb08",
"highway08",
"city08")
# select necessary variables
df_mpg <- raw_data %>%
dplyr::select(tidyselect::all_of(vars_needed))
# remove full dataframe from environment
rm(raw_data)
```
As you have seen before, some of these variables can be coded as **factors**,
which are *categorical* variables that can be classified into discrete levels.
For example, there are a finite number of vehicle transmission (`trany`) or
drivetrain (`drive`) types on the market, and, by telling R to code these data
as factors, we can analyze these variables in categorical form.
First, we will create a vector of variable names that we want to code as
factors, `vars_factr`. Then we will apply the `as.factor()` function to those
variables using `dplyr::mutate(dplyr::across())`. The `across()` function from
`dplyr` allows one to apply the same transformation to multiple columns in a
dataframe and is similar to how we used `tidyselect::all_of()` in the previous
example. We will also take the opportunity to rename a few of these variables,
following our naming guidelines discussed in earlier chapters, and to filter
the data to retain only vehicles made after the year 2000.
```{r fuel-econ-cleaning-2}
# identify the columns that we want to convert to a factor
vars_factr <- c("make", "drive", "trany", "VClass", "fuelType1")
# overwrite existing df
df_mpg <- df_mpg %>%
# convert select vars to factors
dplyr::mutate(dplyr::across(tidyselect::all_of(vars_factr),
.fns = as.factor)) %>%
# overwrite column names with simpler names
dplyr::rename(fuel_type = fuelType1,
cyl = cylinders,
tran = trany,
v_class = VClass) %>% # easier string to type
# keep only data collected after 2000 for the sake of simplicity
dplyr::filter(year >= 2000)
# remove previous objects from global environment
rm(vars_needed, vars_factr)
```
### Check data
Begin as we always do, by simply looking at some of the data.
```{r df-mpg-desc-1}
# examine first five rows of df
head(df_mpg)
```
Next, let's take a look at some of the factor levels. There are lots of ways
to do this in R, but the `levels()` function is the most straightforward.
```{r df-mpg-desc-2}
# print a character vector of levels for the drive variable in df_mpg
levels(df_mpg$drive)
```
```{r df-mpg-desc-3}
# print a character vector of levels for the fuel_type variable in df_mpg
levels(df_mpg$fuel_type)
```
```{r df-mpg-desc-4}
# print a character vector of levels for the v_class variable in df_mpg
levels(df_mpg$v_class)
```
### Check for missing data
Next, let's see whether this dataframe contains missing data (`NA`s).
``` {r mpg-na}
# count instances of missing values within df_mpg
sum(is.na(df_mpg))
# note this can work as a pipe, too
# df_mpg %>% is.na() %>% sum()
```
With a dataframe of this size, we shouldn't be surprised that there are
`r sum(is.na(df_mpg))` `NA` values present. The next question is: where do
these `NA` values show up? There are several ways to answer this question;
here, we will use the `stats::complete.cases()` function with a
`dplyr::filter()` search.
#### Example 1: Missing Data
The `compelete.cases()` function returns a logical vector indicating
which rows are *complete* (i.e., no missing values). The opposite of this
logical function, `!complete.cases()`, should return ONLY those rows that do
contain `NA`s. Let's create a subset of `df_mpg` that only contains rows with
`NA` present.
```{r complete-cases-NA}
# return a df of rows with missing values
df_mpg %>%
filter(!complete.cases(.))
```
Here, we discover that most of the the `NA` values are in the `cyl`, `displ`,
and `tran`, columns. Further, we see that **all** of these vehicles have a
`fuel_type` of *Electricity*, which makes sense because electric vehicles (EVs) do not
have internal combustion. This may be a variable level that we choose to
exclude from certain analyses later...
**You might have noticed the "dot", `.`, used at the end of the `dplyr::filter` pipe above:**
`df_mpg %>% filter(!complete.cases(.))`
*Why is there a `.` as an argument to a function?*
We use `.` as a [*placeholder argument*](https://magrittr.tidyverse.org/reference/pipe.html#arguments) when more than one function is nested within a single section of pipe (`%>%`). The `complete.cases()` function requires an argument to run, so we must tell it to operate on the same dataframe on which `filter()` operates. The `.` accomplishes this need for an argument. In other words, these three lines of code are equivalent:
```{r magrittr-dot, eval=FALSE}
df_mpg %>% filter(!complete.cases(.))
df_mpg %>% filter(!complete.cases(df_mpg))
filter(df_mpg, !complete.cases(df_mpg))
```
***Note:*** the `.` only works as a placeholder argument within a `%>%` call. The following code will throw an error because there is no pipe present:
```{r bad-code, eval=FALSE}
#this code will throw an error
filter(df_mpg, !complete.cases(.))
```
#### Example 2: Missing Data
Filter the `df_mpg` data for *any variables* (`dplyr::any_vars()`) that contain
`NA`. Like `tidyselect::all_of()` that was used to clean the dataframe above,
the `any_vars()` function is a helper function designed for use within `dplyr`
and `tidyr` verbs.
<br>*A word to the wise*: `dplyr::any_vars()` and similar variants
have been **superseded**, which means new updates are not being made to them, and
the developers will eventually encourage users to transition to using
`dplyr::across()`. If you are interested, take a look at [this discussion thread](https://community.rstudio.com/t/using-filter-with-across-to-keep-all-rows-of-a-data-frame-that-include-a-missing-value-for-any-variable/68442)
on the topic. You can also type `vignette("colwise")` into the console to introduce
yourself to the `across()` function. In the code below, I provide examples of the
*superseded* code and the more current (updated) code.
```{r df-mpg-desc-6}
# (superseded) filter the df for rows where any of the observations contain NA
# df_mpg %>%
# dplyr::filter_all(dplyr::any_vars(is.na(.)))
# (updated) example to filter the df for rows containing NA
df_mpg %>%
filter(if_any(.cols = everything(),
.fns = ~ is.na(.x)))
```
#### Example 3: Missing Data
Note: In [Chapter 8](#rprog4), you will learn to "map" the `sum()` and `is.na()` functions to each column of the data frame using `map_dfc` from the `purrr` package, which is designed to apply one ore more functions across columns of a data frame. This approach is the recommended way and I show it mainly as a preview of things to come in [Chapter 8](#rprog4).
``` {r df-mpg-desc-5, eval=FALSE}
# preview of code we will learn in Chapter 8
df_mpg %>% purrr::map_dfc(~sum(is.na(.)))
```
### Visualize data
#### Tidy data
Our primary variable of interest is fuel economy, so let's begin by visualizing
the location, dispersion, and shape of `city08`, `highway08`, and `comb08`,
which represent city, highway and combined fuel economy estimates. Note that
`df_mpg` is not in a tidy format, as fuel economy data is presented in three
different columns. We'll fix that with `tidyr::pivot_longer()`.
```{r tidy-mpg}
# before visualizing the data, convert to tidy format
tidy_mpg <- df_mpg %>%
# choose simple names
dplyr::rename(city = city08,
hwy = highway08,
comb = comb08) %>%
# change data structure from wide to long
tidyr::pivot_longer(cols = c("city", "hwy", "comb"),
names_to = "metric",
values_to = "mpg") %>%
# switch var to factor
dplyr::mutate(metric = as.factor(metric))
```
#### Multivariate EDA options
With the data properly tidy, we can use our basic EDA plots with `stat_ecdf()`,
`geom_boxplot()`, and `geom_histogram()`, to visualize the fuel economy data.
To show the different variables within the same plot, we use `color = `
or `fill = ` as an extra aesthetic. All three plots are then laid out using
`gridExtra::grid.arrange()`.
```{r tidy-mpg-plot, fig.cap="Cumulative Distribution, Histogram, and Boxplots of 21st Centurty Vehicle Fuel Efficiency.", fig.align="center", fig.height=7}
# create cumulative distribution function plot
ecdf <- ggplot2::ggplot(data = tidy_mpg,
mapping = aes(x = mpg,
color = metric)) +
stat_ecdf() +
theme_bw() +
scale_x_log10() +
xlab(NULL) +
ylab("Quantile")
# create boxplot
box <- ggplot2::ggplot(data = tidy_mpg,
mapping = aes(x = mpg,
fill = metric,
y = metric)) +
geom_boxplot(outlier.alpha = 0.05) +
theme_bw() +
scale_x_log10() +
xlab(NULL) +
ylab("Metric")
# create histogram
hist <- ggplot2::ggplot(data = tidy_mpg,
mapping = aes(x = mpg,
fill = metric)) +
geom_histogram(bins = 35,
alpha = 0.75,
position = "stack") +
theme_bw() +
scale_x_log10() +
xlab("Fuel Economy, mi/gal") +
ylab("Counts")
# embed plots into one figure
gridExtra::grid.arrange(ecdf, box, hist,
widths = c(0.4,1,0.4),
layout_matrix = rbind(c(NA, 1, NA),
c(NA, 2, NA),
c(NA, 3, NA)))
```
### Multivariate time series
Next, let's look at a rough time series (by year) of all the combined fuel
economy values, `comb08`, for all vehicle observations. The fuel economy data
will be shown with boxplots, and we will use `group = year` as an aesthetic to
show the overall time series. We will use a log-scale y-axis due to the large
variation expected, and outliers will be made more transparent to soften their
effect.
```{r fuel-econ-ts, warning=FALSE, fig.cap="Boxplots of fleet-wide fuel efficiency by year.", fig.align="center"}
# create time series plot
e1 <- ggplot2::ggplot(data = df_mpg,
mapping = aes(x = year,
y = comb08)) +
geom_boxplot(aes(group = year),
fill = "skyblue",
outlier.alpha = 0.1) +
scale_y_log10(limits = c(10,150)) +
ylab("Combined Fuel Efficiency (mi/gal)") +
xlab("") +
theme_bw(base_size = 14)
# call plot
e1
```
#### Consider context
[This Department of Energy website](https://afdc.energy.gov/data/10562) outlines
the *Corporate Average Fuel Economy* (CAFE) standards from the Environmental
Protection Agency that require vehicles to meet set fuel economy levels, in
terms of miles per gallon, or "mpg", across the "fleet" of available vehicles.
Let's load a .csv file named `cafe` and look at the requirements by year.
```{r CAFE-stds, message = FALSE}
# import data
cafe <- read_csv("./data/CAFE_stds.csv", col_names = c("year", "mpg_avg"), skip = 1)
# plot cafe data
cafe.plot <- ggplot2::ggplot(data = cafe,
mapping = aes(group = year,
y = mpg_avg,
x = year)) +
geom_col(fill = "maroon") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1)) +
labs( y = "Required Average Fuel Efficiency (mpg)",
x = "Year",
title = "Federal Combined Average Fuel Economy (CAFE) Standards")
# call cafe plot
cafe.plot
```
This is a nice
[explanatory story](https://www.edmunds.com/fuel-economy/faq-new-corporate-average-fuel-economy-standards.html)
of why the average fuel economy numbers don't match well to the published CAFE
standards.
### Density plots
A ***density plot*** is akin to a *smoothed histogram*, where the y-axis depicts the frequency (or probability of occurrence) and the x-axis represents the magnitude of the observed variable. They are called though `ggplot2::geom_density` and `ggplot2::stat_density`. Density plots are useful when you have a lot of data (typically hundreds to thousands of observations) because they allow you to visualize the **shape** of a distribution - namely, it's central tendency(ies), dispersion, and skew.
Create a series of geom_density plots showing combined fuel economy `comb08`
across all vehicles and years as a function of `fuel_type`. Break out the
different `fuel_type` categories into facets (ncol = 3).
```{r eda-fuel-1}
# density plot of combined fuel economy by fuel type with facets
g1 <- ggplot2::ggplot(data = df_mpg) +
geom_density(aes(x = comb08, # notice the different location option for x aes
fill = fuel_type)) +
facet_wrap(~ fuel_type,
ncol = 3) +
scale_x_log10() +
theme_bw(base_size = 14) +
theme(legend.position = "none")
# call plot
g1
```
Now show the same plot without facets.
```{r eda-fuel-2}
# density plot of combined fuel economy by fuel type without facets
g2 <- ggplot2::ggplot(data = df_mpg,
mapping = aes(x = comb08)) + # now x aes is here
geom_density(aes(fill = fuel_type),
position = "identity",
alpha = 0.6,
adjust = 1) +
scale_x_log10() +
theme_bw()
# call plot
g2
```
I like these plots because they lead to more questions (and that's the *point* of exploratory data analysis)! Why are some of the categories bimodal? Why do many natural gas vehicles tend to have the lowest fuel efficiency? Are any of these vehicles tested on multiple fuel types? What is the most fuel efficient vehicle sold today?
### More multivariate plotting
Next, let's examine the effect of vehicle class on combined fuel economy for a
single year: 2023. Note, for the plot below, since *vehicle class* (`v_class`) is a `factor` type of variable, I can structure the plot to show `v_class` from lowest median efficiency to highest median efficiency (where the median is taken from each class based on the `highway_08` variable). This is accomplished with the `fct_reorder()` function from the `forcats::` package: `forcats::fct_reorder(.f = v_class, .x = highway, .fun = median)`.
```{r v-class, fig.cap="Highway Fuel Economy for 2023 Vehicles by Type", fig.align="center"}
# filter to year 2023 and reorder factor levels
df_mpg.2023 <- df_mpg %>%
dplyr::filter(year == 2023) %>%
dplyr::mutate(v_class = forcats::fct_reorder(.f = v_class,
.x = highway08,
.fun = median))
# create plot of 2023 combined fuel economy
g1 <- ggplot2::ggplot(data = df_mpg.2023) +
geom_boxplot(aes(x = highway08,
color = v_class,
y = v_class)) +
scale_x_log10() +
theme_bw() +
ylab("") +
xlab("Highway Fuel Economy") +
theme(legend.position = "none")
# call plot
g1
```
### Practice questions
Finally, let's ask some simple questions and use some basic data wrangling to
get the answers.
**Question 1**
Among 4-cylinder vehicles with Front-Wheel Drive, what make/model has the best highway fuel economy in 2018?
```{r Q1}
df_mpg %>%
dplyr::filter(cyl == 4,
drive == "Front-Wheel Drive",
year == 2018) %>%
dplyr::slice_max(order_by = highway08,
n = 1) %>%
dplyr::select(make, model, drive, year, highway08)
```
**Question 2**
Among 8-cylinder vehicles with Rear-Wheel Drive, what make/model has the
worst city fuel economy in 2019?
```{r Q2}
df_mpg %>%
dplyr::filter(cyl == 8,
drive == "Rear-Wheel Drive",
year == 2019) %>%
dplyr::slice_min(order_by = city08) %>%
dplyr::select(make, model, drive, year, city08)
```
```{r Q2a, echo=FALSE, out.width="450pt", fig.align="center", fig.cap="The Bentley Mulsanne: $330k lets you park it on the sidewalk!"}
knitr::include_graphics("./images/Bentley_Mulsanne_2019.jpg")
```
**Question 3**
Among 2021 vehicles from the `tidy_mpg` data frame, rank the highest-performing vehicle from each manufacturer in terms of fuel efficiency (`mpg`). Pass the search result into a table using the `kable()` function.
```{r Q3}
tidy_mpg %>%
filter(year == 2021) %>%
group_by(make) %>%
slice_max(mpg, n =1, with_ties = FALSE) %>%
select(make, model, mpg) %>%
arrange(-mpg) %>%
kable(format = "html")
```
## Chapter 7 Homework
You will continue to explore data from the National Science Foundation 2017
[National Survey of Graduates](https://www.nsf.gov/statistics/srvygrads/){target="_blank"}, which were introduced in an earlier chapter.
From Canvas, download the R Markdown template and the reduced dataset (.csv)
containing salary, age, and gender information. Your knitted submission is due
at the start of the class for the exam review.