This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 13
/
03-visualization.Rmd
executable file
·1090 lines (720 loc) · 67.6 KB
/
03-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# (PART) Data Science via the tidyverse {-}
# Data Visualization via ggplot2 {#viz}
```{r setup-viz, include=FALSE, purl=FALSE}
chap <- 3
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC
# solutions, which are written like markdown text. In theory, it shouldn't be
# necessary for knitr versions <=1.11.6, but I've found I still need to for
# everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL"
# (to show all solutions), or subsets of c('3-2', '3-3',
# '3-4','3-5','3-6','3-7'), including the null vector c('') to show no
# solutions.
# solutions_shown <- c('3-1', '3-2', '3-3','3-4','3-5','3-6', '3-7', '3-8', '3-9',
# '3-10', '3-11', '3-12', '3-13', '3-14')
solutions_shown <- c('')
show_solutions <- function(section){
return(solutions_shown == "ALL" | section %in% solutions_shown)
}
```
We begin the development of your data science toolbox with data visualization. By visualizing our data, we will be able to gain valuable insights from our data that we couldn't initially see from just looking at the raw data in spreadsheet form. We will use the `ggplot2` package as it provides an easy way to customize your plots and is rooted in the data visualization theory known as _The Grammar of Graphics_ [@wilkinson2005].
At the most basic level, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way for us to get a sense for how quantitative variables compare in terms of their center (where the values tend to be located) and their spread (how they vary around the center). The most important thing to know about graphics is that they should be created to make it obvious for your audience to understand the findings and insight you want to get across. This does however require a balancing act. On the one hand, you want to highlight as many meaningful relationships and interesting findings as possible, but on the other you don't want to include so many as to overwhelm your audience.
As we will see, plots/graphics also help us to identify patterns and outliers in our data. We will see that a common extension of these ideas is to compare the *distribution* of one quantitative variable (i.e., what the spread of a variable looks like or how the variable is *distributed* in terms of its values) as we go across the levels of a different categorical variable.
### Needed packages {-}
Let's load all the packages needed for this chapter (this assumes you've already installed them). Read Section \@ref(packages) for information on how to install and load R packages.
```{r message=FALSE}
library(nycflights13)
library(ggplot2)
library(dplyr)
library(knitr)
```
```{r message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(gapminder)
library(knitr)
```
---
<!--Subsection on Grammar of Graphics -->
## The Grammar of Graphics {#grammarofgraphics}
We begin with a discussion of a theoretical framework for data visualization known as the "The Grammar of Graphics," which serves as the basis for the `ggplot2` package. Much like how we construct sentences in any language by using a linguistic grammar (nouns, verbs, subjects, objects, etc.), the theoretical framework given by Leland Wilkinson [@wilkinson2005] allows us to specify the components of a statistical graphic.
### Components of the Grammar
In short, the grammar tells us that:
> **A statistical graphic is a mapping of `data` variables to `aes`thetic attributes of `geom`etric objects.**
Specifically, we can break a graphic into the following three essential components:
1. `data`: the data-set comprised of variables that we map.
1. `geom`: the geometric object in question. This refers to our type of objects we can observe in our plot. For example, points, lines, bars, etc.
1. `aes`: aesthetic attributes of the geometric object that we can perceive on a graphic. For example, x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data-set.
Let's break down the grammar with an example.
### Gapminder {#gapminder}
```{r, echo=FALSE}
gapminder_2007 <- gapminder %>%
filter(year == 2007) %>%
select(-year) %>%
rename(
Country = country,
Continent = continent,
`Life Expectancy` = lifeExp,
`Population` = pop,
`GDP per Capita` = gdpPercap
)
```
In February 2006, a statistician named Hans Rosling gave a TED talk titled ["The best stats you've ever seen"](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) where he presented global economic, health, and development data from the website [gapminder.org](http://www.gapminder.org/tools/#_locale_id=en;&chart-type=bubbles). For example, from the `r nrow(gapminder)` countries included from 2007, consider only the first 6 countries when listed alphabetically:
```{r, echo=FALSE}
gapminder_2007 %>%
head() %>%
kable(
digits=2,
caption = "Gapminder 2007 Data",
booktabs = TRUE
)
```
Each row in this table corresponds to a country in 2007. For each row, we have 5 columns:
1. **Country**: Name of country.
1. **Continent**: Which of the five continents the country is part of. (Note that `Americas` groups North and South America and that Antarctica is excluded here.)
1. **Life Expectancy**: Life expectancy in years.
1. **Population**: Number of people living in the country.
1. **GDP per Capita**: Gross domestic product (in US dollars).
Now consider Figure \@ref(fig:gapminder), which plots this data for all `r nrow(gapminder_2007)` countries in the data frame. Note that R will deal with large numbers using scientific notation. So in the legend for "Population", 1.25e+09 = $1.25 \times 10^{9}$ = 1,250,000,000 = 1.25 billion.
```{r gapminder, echo=FALSE, fig.cap="Life Expectancy over GDP per Capita in 2007"}
ggplot(data=gapminder_2007, aes(x=`GDP per Capita`, y=`Life Expectancy`, size=Population, col=Continent)) +
geom_point()
```
Let's view this plot through the grammar of graphics:
1. The `data` variable **GDP per Capita** gets mapped to the `x`-position `aes`thetic of the points.
1. The `data` variable **Life Expectancy** gets mapped to the `y`-position `aes`thetic of the points.
1. The `data` variable **Population** gets mapped to the `size` `aes`thetic of the points.
1. The `data` variable **Continent** gets mapped to the `color` `aes`thetic of the points.
Recall that `data` here corresponds to each of the variables being in the same `data` frame and the "data variable" corresponds to a column in a data frame.
While in this example we are considering one type of `geom`etric object (of type `point`), graphics are not limited to just points. Some plots involve lines while others involve bars. Let's summarize the three essential components of the grammar in a table:
```{r, echo=FALSE}
map <- data_frame(
`data variable` = c("GDP per Capita", "Life Expectancy", "Population", "Continent"),
aes = c("x", "y", "size", "color"),
geom = c("point", "point", "point", "point")
)
map %>%
kable(
caption = "Summary of Grammar of Graphics for this plot",
booktabs = TRUE
)
```
### Other components of the Grammar
There are other components of the Grammar of Graphics we can control. As you start to delve deeper into the Grammar of Graphics, you'll start to encounter these topics more and more often. In this book, we'll only work with the two other components below (The other components are left to a more advanced text such as [R for Data Science](http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings) [@rds2016]):
- `facet`ting breaks up a plot into small multiples corresponding to the levels of another variable (Section \@ref(facets))
- `position` adjustments for barplots (Section \@ref(geombar))
<!--
- `scales` that both
+ convert *data units* to *physical units* the computer can display. For example, apply a log-transformation on one of the axes to focus on multiplicative rather than additive changes.
+ draw a legend and/or axes, which provide an inverse mapping to make it possible to read the original data values from the graph.
- `coord`inate system for x/y values: typically `cartesian`, but can also be `map` or `polar`.
- `stat`istical transformations: this includes smoothing, binning values into a histogram, or no transformation at all (known as the `"identity"` transformation).
-->
In general, the Grammar of Graphics allows for a high degree of customization and also a consistent framework for easy updating/modification of plots.
### The ggplot2 package
In this book, we will be using the `ggplot2` package for data visualization, which is an implementation of the Grammar of Graphics for R [@R-ggplot2]. You may have noticed that a lot of the previous text in this chapter is written in computer font. This is because the various components of the Grammar of Graphics are specified in the `ggplot` function, which expects at a bare minimal as arguments:
* The data frame where the variables exist: the `data` argument
* The mapping of the variables to aesthetic attributes: the `mapping` argument, which specifies the `aes`thetic attributes involved
After we've specified these components, we then add *layers* to the plot. The most essential layer to add to a plot is the specification of which type of `geom`etric object we want the plot to involve; e.g. points, lines, bars. Other layers we can add include the specification of the plot title, axes labels, and visual themes for the plot.
Let's now put the theory of the Grammar of Graphics into practice.
<!--
The plot given above is not a histogram, but the output does show us a bit of what is going on with `ggplot(data = weather, mapping = aes(x = temp))`. It is producing a backdrop onto which we will "paint" elements. We next proceed by adding a layer---hence, the use of the `+` symbol---to the plot to produce a histogram. (Note also here that we don't have to specify the `data = ` and `mapping = ` text in our function calls. This is covered in more detail in Chapter 5 of the "Getting Used to R, RStudio, and R Markdown" book [@usedtor2016]).
-->
<!--
```{block viz_review, type='review'}
**_Review questions_**
```
**`paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
- Have a variety of bad plots with data for the readers and have readers create better plots with `ggplot2`
- Have sample datasets to work with from problem statements
+ Identify the appropriate plot to address the questions of interest
- Why is it important for barplots to start at zero?
-->
---
<!--Subsection on 5NG -->
## Five Named Graphs - The 5NG {#FiveNG}
For our purposes, we will be limiting consideration to five different types of graphs. We term these five named graphs the **5NG**:
1. scatterplots
1. linegraphs
1. boxplots
1. histograms
1. barplots
We will discuss some variations of these plots, but with this basic repertoire in your toolbox you can visualize a wide array of different data variable types. Note that certain plots are only appropriate for categorical/logical variables and others only for quantitative variables. You'll want to quiz yourself often as we go along on which plot makes sense a given a particular problem or data-set.
---
<!--Subsection on scatter plots-->
## 5NG#1: Scatterplots {#scatterplots}
The simplest of the 5NG are *scatterplots* (also called bivariate plots); they allow you to investigate the relationship between two continuous variables. While you may already be familiar with this type of plot, let's view it through the lens of the Grammar of Graphics. Specifically, we will graphically investigate the relationship between the following two continuous variables in the `flights` data frame:
1. `dep_delay`: departure delay on the horizontal "x" axis and
1. `arr_delay`: arrival delay on the vertical "y" axis
for Alaska Airlines flights leaving NYC in 2013. This requires paring down the `flights` data frame to a smaller data frame `all_alaska_flights` consisting of only Alaska Airlines (carrier code "AS") flights. Don't worry for now what this code in doing, we'll see this in Chapter \@ref(wrangling), just run it all and understand that we are taking all flights and only considering those corresponding to Alaska Airlines.
```{r}
all_alaska_flights <- flights %>%
filter(carrier == "AS")
```
This code snippet makes use of functions in the `dplyr` package for data wrangling to achieve our goal: it takes the `flights` data frame and `filter`s it to only return the rows which meet the condition `carrier == "AS"`. Recall from Section \@ref(code) that testing for equality is specified with `==` and not `=`. You will see many more examples of `==` and `filter()` in Chapter \@ref(wrangling).
```{block lc-all_alaska_flights, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look at both the `flights` and `all_alaska_flights` data frames by running `View(flights)` and `View(all_alaska_flights)` in the console. In what respect do these data frames differ?
```{asis lc-all_alaska_flights-solutions, include=show_solutions('3-1')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc), ")")`**: `flights` contains all flight data, while `all_alaska_flights` contains only data from Alaskan carrier "AS". We can see that flights has `r nrow(flights)` rows while `all_alaska_flights` has only `r nrow(all_alaska_flights)`
```
```{block, type='learncheck', purl=FALSE}
```
### Scatterplots via geom_point {#geompoint}
We proceed to create the scatterplot using the `ggplot()` function:
```{r noalpha, fig.cap="Arrival Delays vs Departure Delays for Alaska Airlines flights from NYC in 2013"}
ggplot(data = all_alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point()
```
In Figure \@ref(fig:noalpha) we see that a positive relationship exists between `dep_delay` and `arr_delay`: as departure delays increase, arrival delays tend to also increase. We also note that the majority of points fall near the point (0, 0). There is a large mass of points clustered there. Let's break this down, keeping in mind our discussion in Section \@ref(grammarofgraphics):
* Within the `ggplot()` function call, we specify two of the components of the grammar:
1. The `data` frame to be `all_alaska_flights` by setting `data = all_alaska_flights`
1. The `aes`thetic mapping by setting `aes(x = dep_delay, y = arr_delay)`. Specifically
* the variable `dep_delay` maps to the `x` position aesthetic
* the variable `arr_delay` maps to the `y` position aesthetic
* We add a layer to the `ggplot()` function call using the `+` sign. The layer in question specifies the third component of the grammar: the `geom`etric object. In this case the geometric object are `point`s, set by specifying `geom_point()`.
Some notes on layers:
* Note that the `+` sign comes at the end of lines, and not at the beginning. You'll get an error in R if you put it at the beginning.
* When adding layers to a plot, you are encouraged to hit *Return* on your keyboard after entering the `+` so that the code for each layer is on a new line. As we add more and more layers to plots, you'll see this will greatly improve the legibility of your code.
* To stress the importance of adding layers, in particular the layer specifying the `geom`etric object, consider Figure \@ref(fig:nolayers) where no layers are added. A not very useful plot!
```{r nolayers, fig.cap="Plot with No Layers"}
ggplot(data = all_alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))
```
```{block lc-scatterplots, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some practical reasons why `dep_delay` and `arr_delay` have a positive relationship?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What variables (not necessarily in the `flights` data frame) would you expect to have a negative correlation (i.e. a negative relationship) with `dep_delay`? Why? Remember that we are focusing on continuous variables here.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some other features of the plot that stand out to you?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Create a new scatterplot using different variables in the `all_alaska_flights` data frame by modifying the example above.
```{asis lc-scatterplots-solutions, include=show_solutions('3-2')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 4), ")")`**: What are some practical reasons why `dep_delay` and `arr_delay` have a positive relationship? *The later a plane departs, typically the later it will arrive.*
**`r paste0("(LC", chap, ".", (lc - 3), ")")`** An example in the `weather` dataset is `visibility`, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease.
**`r paste0("(LC", chap, ".", (lc - 2), ")")`**: What does (0, 0) correspond to from the point of view of a passenger on an Alaskan flight? Why do you believe there is a cluster of points near (0, 0)? *The point (0,0) means no delay in departure and arrival. From the passenger's point of view, this means the flight was on time. It seems most flights are at least close to being on time.*
**`r paste0("(LC", chap, ".", (lc - 1), ")")`**: Create a similar plot, but one showing the relationship between departure time and departure delay. What hypotheses do you have about the patterns you see? *We now put `dep_time` as the `x`-aesthetic and `dep_delay` as the `y`-aesthetic*
```
``` {r, include=show_solutions('4-3'), echo=show_solutions('4-3')}
ggplot(data = all_alaska_flights, mapping = aes(x = dep_time, y = dep_delay)) +
geom_point()
```
```{block, type='learncheck', purl=FALSE}
```
### Over-plotting {#overplotting}
The large mass of points near (0, 0) in Figure \@ref(fig:noalpha) can cause some confusion. This is the result of a phenomenon called *overplotting*. As one may guess, this corresponds to values being plotted on top of each other _over_ and _over_ again. It is often difficult to know just how many values are plotted in this way when looking at a basic scatterplot as we have here. There are two ways to address this issue:
1. By adjusting the transparency of the points via the `alpha` argument
1. By jittering the points via `geom_jitter()`
The first way of relieving overplotting is by changing the `alpha` argument in `geom_point()` which controls the transparency of the points. By default, this value is set to `1`. We can change this to any value between `0` and `1` where `0` sets the points to be 100% transparent and `1` sets the points to be 100% opaque. Note how the following function call is identical to the one in Section \@ref(scatterplots), but with `alpha = 0.2` added to the `geom_point()`.
```{r alpha, fig.cap="Delay scatterplot with alpha=0.2"}
ggplot(data = all_alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2)
```
The key feature to note in Figure \@ref(fig:alpha) is that the transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark.
Note that there is no `aes()` surrounding `alpha = 0.2` here. Since we are NOT mapping a variable to an aesthetic but instead are just changing a setting, we don't need to create a mapping with `aes()`. In fact, you'll receive an error if you try to change the second line above to `geom_point(aes(alpha = 0.2))`.
The second way of relieving overplotting is to *jitter* the points a bit. In other words, we are going to add just a bit of random noise to the points to better see them and remove some of the overplotting. You can think of "jittering" as shaking the points around a bit on the plot. Instead of using `geom_point`, we use `geom_jitter` to perform this shaking. To specify how much jitter to add, we adjust the `width` and `height` arguments. This corresponds to how hard you'd like to shake the plot in units corresponding to those for both the horizontal and vertical variables (in this case, minutes).
```{r jitter, fig.cap="Jittered delay scatterplot"}
ggplot(all_alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30)
```
Note how this function call is identical to the one in Subsection \@ref(geompoint), but with `geom_point()` replaced with `geom_jitter()`. The plot in Figure \@ref(fig:jitter) helps us a little bit in getting a sense for the overplotting, but with a relatively large data-set like this one (`r nrow(all_alaska_flights)` flights), it can be argued that changing the transparency of the points by setting `alpha` proved more effective.
You may have noticed that in the code to create Figure \@ref(fig:jitter) have also dropped the `data = ` and also the `mapping = ` code before `aes` in this example. Since `ggplot` is expecting its first argument `data` to be a data frame and its second argument to correspond to `mapping = `, you can omit both and you'll get the same plot. As you get more and more practice, you'll likely find yourself not including the specification of the argument like this. It's good practice to always include it though, especially as you are just beginning to practice with R code.
```{block lc-overplotting, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why is setting the `alpha` argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** After viewing the Figure \@ref(fig:alpha) above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the `alpha = 0.2` set in Figure \@ref(fig:noalpha)?
```{asis lc-overplotting-solutions, include=show_solutions('3-3')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 1), ")")`**: Why is setting the `alpha` argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot? *It thins out the points so we address overplotting. But more importantly it hints at the (statistical) **density** and **distribution** of the points: where are the points concentrated, where do they occur. We will see more about densities and distributions in Chapter 6 when we switch gears to statistical topics.*
**`r paste0("(LC", chap, ".", (lc), ")")`**: After viewing the Figure \@ref(fig:alpha) above, give a range of arrival delays and departure delays that occur most frequently? How has that region changed compared to when you observed the same plot without the `alpha = 0.2` set in Figure \@ref(fig:noalpha)? *The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.*
```
```{block, type='learncheck', purl=FALSE}
```
<!--
Maybe include a shading of the points by another variable example here for multivariate thinking?
-->
### Summary
Scatterplots display the relationship between two continuous variables. They are among the most commonly used plots because they can provide an immediate way to see the trend in one variable versus another. However, if you try to create a scatterplot where either one of the two variables is not quantitative, you will get strange results. Be careful!
With medium to large data-sets, you may need to play with either `geom_jitter()` or the `alpha` argument in order to get a good feel for relationships in your data. This tweaking is often a fun part of data visualization since you'll have the chance to see different relationships come about as you make subtle changes to your plots.
---
<!--Subsection on line graphs-->
## 5NG#2: Linegraphs {#linegraphs}
The next of the 5NG is a linegraph. They are most frequently used when the x-axis represents time and the y-axis represents some other numerical variable; such plots are known as *time series*. Time represents a variable that is connected together by each day following the previous day. In other words, time has a natural ordering. Linegraphs should be avoided when there is not a clear sequential ordering to the explanatory variable, i.e. the x-variable or the *predictor* variable.
Our focus now turns to the `temp` variable in this `weather` data-set. By
* Looking over the `weather` data-set by typing `View(weather)` in the console.
* Running `?weather` to bring up the help file.
We can see that the `temp` variable corresponds to hourly temperature (in Fahrenheit) recordings at weather stations near airports in New York City. Instead of considering all hours in 2013 for all three airports in NYC, let's focus on the hourly temperature at Newark airport (`origin` code "EWR") for the first 15 days in January 2013. The `weather` data frame in the `nycflights13` package contains this data, but we first need to filter it to only include those rows that correspond to Newark in the first 15 days of January.
```{r}
early_january_weather <- weather %>%
filter(origin == "EWR" & month == 1 & day <= 15)
```
This is similar to the previous use of the `filter` command in Section \@ref(scatterplots), however we now use the `&` operator. The above selects only those rows in `weather` where the originating airport is `"EWR"` **and** we are in the first month **and** the day is from 1 to 15 inclusive.
```{block lc-early_january_weather, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Take a look at both the `weather` and `early_january_weather` data frames by running `View(weather)` and `View(early_january_weather)` in the console. In what respect do these data frames differ?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** The weather data is recorded hourly. Why does the `time_hour` variable correctly identify the hour of the measurement whereas the `hour` variable does not?
```{asis lc-early_january_weather-solutions, include=show_solutions('3-4')}
**Learning Check Solutions**
**LC `r paste0("(LC", chap, ".", (lc - 1), ")")`**: Take a look at both the `weather` and `early_january_weather` data frames by running `View(weather)` and `View(early_january_weather)` in the console. In what respect do these data frames differ? *The rows of `early_january_weather` are a subset of `weather`.*
**LC`r paste0("(LC", chap, ".", (lc), ")")`**: The `weather` data is recorded hourly. Why does the `time_hour` variable correctly identify the hour of the measurement whereas the `hour` variable does not? *Because to uniquely identify an hour, we need the `year`/`month`/`day`/`hour` sequence, whereas there are only 24 possible `hour`'s. Note that in the case of `weather`, there is a timezone bug: the `time_hour` variable is off by 5 hours from the `year`/`month`/`day`/`hour` sequence, since the Eastern Time Zone is 5 hours off UTC.*
```
```{block, type='learncheck', purl=FALSE}
```
### Linegraphs via geom_line {#geomline}
We plot a linegraph of hourly temperature using `geom_line()`:
```{r hourlytemp, fig.cap="Hourly Temperature in Newark for January 1-15, 2013"}
ggplot(data = early_january_weather, aes(x = time_hour, y = temp)) +
geom_line()
```
Much as with the `ggplot()` call in Chapter \@ref(geompoint), we describe the components of the Grammar of Graphics:
* Within the `ggplot()` function call, we specify two of the components of the grammar:
1. The `data` frame to be `early_january_weather` by setting `data = early_january_weather`
1. The `aes`thetic mapping by setting `aes(x = time_hour, y = temp)`. Specifically
* `time_hour` (i.e. the time variable) maps to the `x` position
* `temp` maps to the `y` position
* We add a layer to the `ggplot()` function call using the `+` sign
* The layer in question specifies the third component of the grammar: the `geom`etric object in question. In this case the geometric object is a `line`, set by specifying `geom_line()`.
```{block lc-linegraph, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why are linegraphs frequently used when time is the explanatory variable?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Plot a time series of a variable other than `temp` for Newark Airport in the first 15 days of January 2013.
```{asis lc-linegraph-solutions, include=show_solutions('3-5')}
**Learning Check Solutions**
**LC `r paste0("(LC", chap, ".", (lc - 2), ")")`**: Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis? *Because lines suggest connectedness and ordering.*
**LC `r paste0("(LC", chap, ".", (lc - 1), ")")`**: Why are linegraphs frequently used when time is the explanatory variable? *Because time is sequential: subsequent observations are closely related to each other.*
**LC `r paste0("(LC", chap, ".", (lc), ")")`**: Plot a time series of a variable other than `temp` for Newark Airport in the first 15 days of January 2013. *Humidity is a good one to look at, since this very closely related to the cycles of a day.*
```
```{r, include=show_solutions('4-4'), echo=show_solutions('3-6')}
ggplot(data = early_january_weather, aes(x = time_hour, y = humid)) +
geom_line()
```
```{block, type='learncheck', purl=FALSE}
```
### Summary
Linegraphs, just like scatterplots, display the relationship between two continuous variables. However, the variable on the x-axis (i.e. the explanatory variable) should have a natural ordering, like some notion of time. We can mislead our audience if that isn't the case.
---
<!--Subsection on histograms -->
## 5NG#3: Histograms {#histograms}
Let's consider the `temp` variable in the `weather` data frame once again, but now unlike with the linegraphs in Chapter \@ref(linegraphs), let's say we don't care about the relationship of temperature to time, but rather we care about the *(statistical) distribution* of temperatures. We could just produce points where each of the different values appear on something similar to a number line:
```{r echo=FALSE, fig.height=0.8, fig.cap="Plot of Hourly Temperature Recordings from NYC in 2013"}
ggplot(data = weather, mapping = aes(x = temp, y = factor("A"))) +
geom_point() +
theme(axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank())
hist_title <- "Histogram of Hourly Temperature Recordings from NYC in 2013"
```
This gives us a general idea of how the values of `temp` differ. We see that temperatures vary from around `r round(min(weather$temp, na.rm = TRUE), 0)` up to `r round(max(weather$temp, na.rm = TRUE), 0)` degrees Fahrenheit. The area between 40 and 60 degrees appears to have more points plotted than outside that range.
### Histograms via geom_histogram {#geomhistogram}
What is commonly produced instead of the above plot is a plot known as a *histogram*. The histogram shows how many elements of a single numerical variable fall in specified *bins*. In this case, these bins may correspond to between 0-10°F, 10-20°F, etc. We produce a histogram of the hour temperatures at all three NYC airports in 2013:
```{r, warning=TRUE, fig.cap=hist_title}
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram()
```
Note here:
* There is only one variable being mapped in `aes()`: the single continuous variable `temp`. You don't need to compute the y-aesthetic: it gets computed automatically.
* We set the `geom`etric object to be `geom_histogram()`
* We got a warning message of `1 rows containing non-finite values` being removed. This is due to one of the values of temperature being missing. R is alerting us that this happened.
* Another warning corresponds to an urge to specify the number of bins you'd like to create.
### Adjusting the bins {#adjustbins}
We can adjust characteristics of the bins in one of *two* ways:
1. By adjusting the number of bins via the `bins` argument
1. By adjusting the width of the bins via the `binwidth` argument
First, we have the power to specify how many bins we would like to put the data into as an argument in the `geom_histogram()` function. By default, this is chosen to be 30 somewhat arbitrarily; we have received a warning above our plot that this was done.
```{r fig.cap=paste(hist_title, "- 60 Bins")}
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 60, color = "white")
```
Note the addition of the `color` argument. If you'd like to be able to more easily differentiate each of the bins, you can specify the color of the outline as done above.
Second, instead of specifying the number of bins, we can also specify the width of the bins by using the `binwidth` argument in the `geom_histogram` function.
```{r fig.cap=paste(hist_title, "- Binwidth = 10"), fig.height=5}
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 10, color = "white")
```
```{block lc-histogram, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Would you classify the distribution of temperatures as symmetric or skewed?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What would you guess is the "center" value in this distribution? Why did you make that choice?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Is this data spread out greatly from the center or is it close? Why?
```{asis lc-histogram-solutions, include=show_solutions('3-7')}
**Learning Check Solutions**
**LC `r paste0("(LC", chap, ".", (lc - 3), ")")`: What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?** The distribution doesn't change much. But by refining the bid width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the `temp` variabile by `View(weather)`, we see that the precision of each temperature recording is 2 decimal places.
**LC `r paste0("(LC", chap, ".", (lc - 2), ")")`: Would you classify the distribution of temperatures as symmetric or skewed?** It is rather symmetric, i.e. there are no __long tails__ on only one side of the distribution
**LC `r paste0("(LC", chap, ".", (lc - 1), ")")`: What would you guess is the "center" value in this distribution? Why did you make that choice?** The center is around `r mean(weather$temp, na.rm=TRUE)`°F. By running the `summary()` command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.
**`r paste0("(LC", chap, ".", (lc), ")")`: Is this data spread out greatly from the center or is it close? Why?**
This can only be answered relatively speaking! Let's pick things to be relative to Seattle, WA temperatures:
![](images/temp.png)
While, it appears that Seattle weather has a similar center of 55°F, its
temperatures are almost entirely between 35°F and 75°F for a range of
about 40°F. Seattle temperatures are much less spread out than New York i.e.
much more consistent over the year. New York on the other hand has much colder
days in the winter and much hotter days in the summer. Expressed differently,
the middle 50% of values, as delineated by the interquartile range is 30°F:
```
```{r, echo=show_solutions('4-5'), include=show_solutions('4-5'), message=FALSE, warning=FALSE}
IQR(weather$temp, na.rm=TRUE)
```
```{r, echo=show_solutions('4-5'), include=show_solutions('4-5'), message=FALSE, warning=FALSE}
summary(weather$temp)
```
```{block, type='learncheck', purl=FALSE}
```
### Summary
Histograms, unlike scatterplots and linegraphs, present information on only a single continuous variable. In particular they are visualizations of the (statistical) distribution of values.
---
<!--Section on Facets-->
## Facets {#facets}
Before continuing the 5NG, we briefly introduce a new concept called *faceting*. Faceting is used when we'd like to create small multiples of the same plot over a different categorical variable. By default, all of the small multiples will have the same vertical axis.
For example, suppose we were interested in looking at how the temperature histograms we saw in Chapter \@ref(histograms) varied by month. This is what is meant by "the distribution of a variable over another variable": `temp` is one variable and `month` is the other variable. In order to look at histograms of `temp` for each month, we add a layer `facet_wrap(~ month)`. You can also specify how many rows you'd like the small multiple plots to be in using `nrow` or how many columns using `ncol` inside of `facet_wrap`.
```{r facethistogram, fig.cap="Faceted histogram"}
ggplot(data = weather, aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~ month, nrow = 4)
```
Note the use of the `~` before `month` in `facet_wrap`. The tilde (`~`) is required and you'll receive the error `Error in as.quoted(facets) : object 'month' not found` if you don't include it before `month` here.
As we might expect, the temperature tends to increase as summer approaches and then decrease as winter approaches.
```{block lc-facet, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** For which types of data-sets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Does the `temp` variable in the `weather` data-set have a lot of variability? Why do you say that?
```{asis lc-facet-solutions, include=show_solutions('3-8')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 3), ")")`: What other things do you notice about the faceted plot above? How does afaceted plot help us see relationships between two variables?**
* Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.
* The two variables we are see the relationship of are `temp` and `month`.
**LC `r paste0("(LC", chap, ".", (lc - 2), ")")`: What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?**
* While month is technically a number between 1-12, we're viewing it as a categorical variable here. Specifically an **ordinal categorical** variable since there is a ordering to the categories
* 25, 50, 75, 100 are temperatures
**`r paste0("(LC", chap, ".", (lc - 1), ")")`: For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the variability of the variables and other important characteristics?** Having histograms split by day would not be great:
* We'd have 365 facets to look at. Way to many.
* We don't really care about day-to-day fluctuation in weather so much, but maybe more week-to-week variation. We'd like to focus on seasonal trends.
**`r paste0("(LC", chap, ".", (lc), ")")`: Does the `temp` variable in the `weather` dataset have a lot of variability? Why do you say that?**
Again, like in LC `r paste0("(LC", chap, ".", (lc-4), ")")`, this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!
```
```{block, type='learncheck', purl=FALSE}
```
---
<!--Subsection on boxplots -->
## 5NG#4: Boxplots {#boxplots}
While using faceted histograms can provide a way to compare distributions of a continuous variable split by groups of a categorical variable as in Section \@ref(facets), an alternative plot called a *boxplot* (also called a *side-by-side boxplot*) achieves the same task and is frequently preferred. The *boxplot* uses the information provided in the *five-number summary* referred to in Appendix \@ref(appendixA). It gives a way to compare this summary information across the different levels of a categorical variable.
### Boxplots via geom_boxplot {#geomboxplot}
Let's create a boxplot to compare the monthly temperatures as we did above with the faceted histograms.
```{r badbox, fig.cap="Invalid boxplot specification", fig.height=3.5}
ggplot(data = weather, aes(x = month, y = temp)) +
geom_boxplot()
```
```
Warning messages:
1: Continuous x aesthetic -- did you forget aes(group=...)?
2: Removed 1 rows containing non-finite values (stat_boxplot).
```
Note the set of warnings that is given here. The second warning corresponds to missing values in the data frame and it is turned off on subsequent plots. Let's focus on the first warning.
Observe that this plot does not look like what we were expecting. We were expecting to see the distribution of temperatures for each month (so 12 different boxplots). The first warning is letting us know that we are plotting a continuous, and not categorical variable, on the x-axis. This gives us the overall boxplot without any other groupings. We can get around this by introducing a new function for our `x` variable:
```{r monthtempbox, fig.cap="Month by temp boxplot", fig.height=3.7}
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
```
We have introduced a new function called `factor()` here. One of the things this function does is to convert a discrete value like `month` (1, 2, ..., 12) into a categorical variable. The "box" part of this plot represents the 25^th^ percentile, the median (50^th^ percentile), and the 75^th^ percentile. The dots correspond to *outliers*. (The specific formulation for these outliers is discussed in Appendix \@ref(appendixA).) The lines show how the data varies that is not in the center 50% defined by the first and third quantiles. Longer lines correspond to more variability and shorter lines correspond to less variability.
```{block lc-boxplot, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which months have the highest variability in temperature? What reasons do you think this is?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** We looked at the distribution of a continuous variable over a categorical variable here with this boxplot. Why can't we look at the distribution of one continuous variable over the distribution of another continuous variable? Say, temperature across pressure, for example?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
```{asis lc-boxplot-solutions, include=show_solutions('3-9')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 3), ")")`: What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.**
It appears to be an outlier. Let's revisit the use of the `filter` command to hone in on it. We want all data points where the `month` is 5 and `temp<25`
```
```{r, echo=show_solutions('3-9'), eval=FALSE}
weather %>%
filter(month==5 & temp < 25)
```
```{r, include=show_solutions('3-9'), echo=FALSE}
weather %>%
filter(month==5 & temp < 25) %>%
kable()
```
```{asis, include=show_solutions('3-9')}
There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake!
Why wasn't the weather at least similar at EWR (Newark) and LGA (La Guardia)?
**`r paste0("(LC", chap, ".", (lc - 2), ")")`: Which months have the highest variability in temperature? What reasons do you think this is?**
We are now interested in the **spread** of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):
* The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
* You can also think of this as the spread of the **middle 50%** of the data
Just from eyeballing it, it seems
* November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
* August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise
Here's how we compute the exact IQR values for each month (we'll see this more in depth Chapter 5 of the text):
1. `group` the observations by `month` then
1. for each `group`, i.e. `month`, `summarize` it by applying the summary statistic function `IQR()`, while making sure to skip over missing data via `na.rm=TRUE` then
1. `arrange` the table in `desc`ending order of `IQR`
```
```{r, echo=show_solutions('3-9'), eval=FALSE}
weather %>%
group_by(month) %>%
summarize(IQR = IQR(temp, na.rm=TRUE)) %>%
arrange(desc(IQR))
```
```{r, echo=FALSE, include=show_solutions('3-9')}
weather %>%
group_by(month) %>%
summarize(IQR = IQR(temp, na.rm=TRUE)) %>%
arrange(desc(IQR)) %>%
kable()
```
```{asis, include=show_solutions('3-9')}
**`r paste0("(LC", chap, ".", (lc - 1), ")")`: We looked at the distribution of a continuous variable over a categorical variable here with this boxplot. Why can't we look at the distribution of one continuous variable over the distribution of another continuous variable? Say, temperature across pressure, for example?**
Because we need a way to group many continuous observations together, say by grouping by month. For pressure, we have near unique values for pressure, i.e. no groups, so we can't make boxplots.
**`r paste0("(LC", chap, ".", (lc), ")")`: Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?**
In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.
```
```{block, type='learncheck', purl=FALSE}
```
### Summary
Boxplots provide a way to compare and contrast the distribution of one quantitative variable across multiple levels of one categorical variable. One can see where the median falls across the different groups by looking at the center line in the box. To see how spread out the variable is across the different groups, look at both the width of the box and also how far the lines stretch vertically from the box. (If the lines stretch far from the box but the box has a small width, the variability of the values closer to the center is much smaller than the variability of the outer ends of the variable.) Outliers are even more easily identified when looking at a boxplot than when looking at a histogram.
---
<!--Subsection on barplots -->
## 5NG#5: Barplots {#geombar}
Both histograms and boxplots represent ways to visualize the variability of continuous variables. Another common task is to present the distribution of a categorical variable. This is a simpler task, focused on how many elements from the data fall into different categories of the categorical variable. Often the best way to visualize these different counts (also known as *frequencies*) is via a barplot, also known as a barchart.
One complication, however, is how your counts are represented in your data. For example, run the following code in your Console. This code manually creates two data frames representing counts of fruit.
```{r}
fruits <- data_frame(
fruit = c("apple", "apple", "apple", "orange", "orange")
)
fruits_counted <- data_frame(
fruit = c("apple", "orange"),
number = c(3, 2)
)
```
We see both the `fruits` and `fruits_counted` data frames represent the same collection of fruit: three apples and two oranges. However, whereas `fruits` just lists the fruit:
```{r, echo=FALSE}
kable(
fruits,
digits=2,
caption = "Fruits",
booktabs = TRUE
)
```
`fruits_counted` has a variable `count`, where the counts are pre-tabulated.
```{r, echo=FALSE}
kable(
fruits_counted,
digits=2,
caption = "Fruits (Pre-Counted)",
booktabs = TRUE
)
```
Compare the barcharts in Figures \@ref(fig:geombar) and \@ref(fig:geomcol), which are identical, but are based on two different data frames:
```{r geombar, fig.cap="Barplot when counts are not pre-tabulated", fig.height=2.5}
ggplot(data = fruits, mapping = aes(x = fruit)) +
geom_bar()
```
```{r, geomcol, fig.cap="Barplot when counts are pre-tabulated", fig.height=2.5}
ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
geom_col()
```
Observe that:
* The code that generates Figure \@ref(fig:geombar) based on `fruits` does not have an explicit `y` `aes`thetic and uses `geom_bar()`
* The code that generates Figure \@ref(fig:geomcol) based on `fruits_counted` has an explicit `y` `aes`thetic (to the variable `number`) and uses `geom_col()`
This one aspect of creating barplots using `ggplot2` causes the most initial confusion: when the categorical variable you want to plot is not pre-tabulated in your data frame you need to use `geom_bar`, but if the categorical variable is pre-tabulated and stored in a variable, you need to use `geom_col` and explicitly map this variable to the `y` aesthetic.
### Barplots via geom_bar/geom_col
Consider the distribution of airlines that flew out of New York City in 2013. Here we explore the number of flights from each airline/`carrier`. This can be plotted by invoking the `geom_bar` function in `ggplot2`:
```{r flightsbar, fig.cap="Number of flights departing NYC in 2013 by airline using geom_bar", fig.height=2.5}
ggplot(data = flights, mapping = aes(x = carrier)) +
geom_bar()
```
To get an understanding of what the names of these airlines are corresponding to these `carrier` codes, we can look at the `airlines` data frame in the `nycflights13` package. Note the use of the `kable` function here in the `knitr` package, which produces a nicely-formatted table of the values in the `airlines` data frame.
```{r}
kable(airlines)
```
Going back to our barplot, we see that United Air Lines, JetBlue Airways, and ExpressJet Airlines had the most flights depart New York City in 2013. To get the actual number of flights by each airline we can use the `group_by()`, `summarize()`, and `n()` functions in the `dplyr` package on the `carrier` variable in `flights`, which we will introduce formally in Chapter \@ref(wrangling).
```{r message=FALSE}
flights_table <- flights %>%
group_by(carrier) %>%
summarize(number = n())
kable(flights_table)
```
In this table, the counts of the carriers are pre-tabulated. To create a barchart using the data frame `flights_table`, we use `geom_col` and map the `y` aesthetic to the variable `number`. Compare this barplot using `geom_col` in Figure \@ref(fig:flightscol) with the earlier barplot using `geom_bar` in Figure \@ref(fig:flightsbar). They are identical.
```{r flightscol, fig.cap="Number of flights departing NYC in 2013 by airline using geom_col", fig.height=2.5}
ggplot(data = flights_table, mapping = aes(x = carrier, y = number)) +
geom_col()
```
<!--
**Technical note**: Refer to the use of `::` in both lines of code above. This is another way of ensuring the correct function is called. A `count` exists in a couple different packages and sometimes you'll receive strange errors when a different instance of a function is used. This is a great way of telling R that "I want this one!". You specify the name of the package directly before the `::` and then the name of the function immediately after `::`.
-->
```{block lc-barplot, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why are histograms inappropriate for visualizing categorical variables?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the difference between histograms and barplots?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How many Envoy Air flights departed NYC in 2013?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly.
```{asis lc-barplot-solutions, include=show_solutions('3-10')}
**Learning Check Solutions**
* **`r paste0("(LC", chap, ".", (lc - 3), ")")`: Why are histograms inappropriate for visualizing categorical variables?** Histograms are for continuous variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.
* **`r paste0("(LC", chap, ".", (lc - 2), ")")`: What is the difference between histograms and barplots?** See above.
* **`r paste0("(LC", chap, ".", (lc - 1), ")")`: How many Envoy Air flights departed NYC in 2013?** Envoy Air is carrier code `MQ` and thus 26397 flights departed NYC in 2013.
* **`r paste0("(LC", chap, ".", (lc), ")")`: What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly.** What a pain! We'll see in Chapter 5 on Data Wrangling that applying `arrange(desc(n))` will sort this table in descending order of `n`!
```
```{block, type='learncheck', purl=FALSE}
```
### Must avoid pie charts!
Unfortunately, one of the most common plots seen today for categorical data is the pie chart. While they may see harmless enough, they actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book "Creating More Effective Graphs" [@robbins2013], we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine relative size of one piece of the pie compared to another.
Let's examine our previous barplot example on the number of flights departing NYC by airline. This time we will use a pie chart. As you review this chart, try to identify
- how much larger the portion of the pie is for ExpressJet Airlines (`EV`) compared to US Airways (`US`),
- what the third largest carrier is in terms of departing flights, and
- how many carriers have fewer flights than United Airlines (`UA`)?
```{r carrierpie, echo=FALSE, fig.cap="The dreaded pie chart", fig.height=5}
ggplot(flights, aes(x = factor(1), fill = carrier)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
guides(fill = guide_legend(keywidth = 0.8, keyheight = 0.8))
```
While it is quite easy to look back at the barplot to get the answer to these questions, it's quite difficult to get the answers correct when looking at the pie graph. Barplots can always present the information in a way that is easier for the eye to determine relative position. There may be one exception from Nathan Yau at [FlowingData.com][fd] but we will leave this for the reader to decide:
[fd]: https://flowingdata.com/2008/09/19/pie-i-have-eaten-and-pie-i-have-not-eaten/ "Pie I Have Eaten and Pie I Have Not Eaten"
```{r echo=FALSE, fig.align='center', fig.cap="The only good pie chart", out.height=if(knitr:::is_latex_output()) '2.5in', purl=FALSE}
knitr::include_graphics("images/Pie-I-have-Eaten.jpg")
```
```{block lc-pie-charts, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why should pie charts be avoided and replaced by barplots?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is your opinion as to why pie charts continue to be used?
```{asis lc-pie-charts-solutions, include=show_solutions('3-11')}
**Learning Check Solutions**
* **`r paste0("(LC", chap, ".", (lc - 1), ")")`: Why should pie charts be avoided and replaced by barplots?** In my **opinion**, comparisons using horizontal lines are easier than comparing angles and areas of circles.
* **`r paste0("(LC", chap, ".", (lc), ")")`: What is your opinion as to why pie charts continue to be used?** Legacy?
```
```{block, type='learncheck', purl=FALSE}
```
### Using barplots to compare two categorical variables
Barplots are the go-to way to visualize the frequency of different categories of a categorical variable. They make it easy to order the counts and to compare the frequencies of one group to another. Another use of barplots (unfortunately, sometimes inappropriately and confusingly) is to compare two categorical variables together. Let's examine the distribution of outgoing flights from NYC by `carrier` and `airport`.
We begin by getting the names of the airports in NYC that were included in the `flights` data-set. Here, we preview the `inner_join()` function from Chapter \@ref(wrangling). This function will join the data frame `flights` with the data frame `airports` by matching rows that have the same airport code. However, in `flights` the airport code is included in the `origin` variable whereas in `airports` the airport code is included in the `faa` variable. We will revisit such examples in Section \@ref(joins) on joining data-sets.
```{r message=FALSE}
flights_namedports <- flights %>%
inner_join(airports, by = c("origin" = "faa"))
```
After running `View(flights_namedports)`, we see that `name` now corresponds to the name of the airport as referenced by the `origin` variable. We will now plot `carrier` as the horizontal variable. When we specify `geom_bar`, it will specify `count` as being the vertical variable. A new addition here is `fill = name`. Look over what was produced from the plot to get an idea of what this argument gives.
```{r, fig.cap="Stacked barplot comparing the number of flights by carrier and airport", fig.height=3.5}
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
geom_bar()
```
This plot is what is known as a *stacked barplot*. While simple to make, it often leads to many problems. For example in this plot, it is difficult to compare the heights of the different colors (corresponding to the number of flights from each airport) between the bars (corresponding to the different carriers).
Note that `fill` is an `aes`thetic just like `x` is an `aes`thetic, and thus must be included within the parentheses of the `aes()` mapping. The following code, where the `fill` `aes`thetic is specified on the outside will yield an error. This is a fairly common error that new `ggplot` users make:
```
ggplot(data = flights_namedports, mapping = aes(x = carrier), fill = name) +
geom_bar()
```
```{block lc-barplot-two-var, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What kinds of questions are not easily answered by looking at the above figure?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?
```{asis lc-barplot-two-var-solutions, include=show_solutions('3-12')}
**Learning Check Solutions**
* **`r paste0("(LC", chap, ".", (lc - 1), ")")` What kinds of questions are not easily answered by looking at the above figure?** Because the red, green, and blue bars don't all start at 0 (only red does), it makes comparing counts hard.
* **`r paste0("(LC", chap, ".", (lc), ")")` What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?** The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn't prefer airports, each color would be roughly one third of each bar.}
```
```{block, type='learncheck', purl=FALSE}
```
Another variation on the stacked barplot is the *side-by-side barplot*.
```{r, fig.cap="Side-by-side barplot comparing the number of flights by carrier and airport", fig.height=5}
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
geom_bar(position = "dodge")
```
```{block lc-barplot-stacked, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Why might the side-by-side barplot be preferable to a stacked barplot in this case?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are the disadvantages of using a side-by-side barplot, in general?
```{asis lc-barplot-stacked-solutions, include=show_solutions('3-13')}
**Learning Check Solutions**