forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
aesthetic_mapping.Rmd
302 lines (241 loc) · 21.4 KB
/
aesthetic_mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(forcats)
library(egg)
library(lubridate)
```
# (PART\*) Part I: From data to visualization {-}
# Visualizing data: mapping data onto aesthetics
Whenever we visualize data, we take data values and convert them in a systematic and logical way into the visual elements that make up the final graphic. Even though there are myriad different data visualizations, and on first glance a scatter plot, a pie chart, and a heatmap don't seem to have much in common, all these visualizations can be described with a common language that captures how data values are turned into blobs of ink on paper or colored pixels on screen. The key insight is the following: All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as *aesthetics.*
## Aesthetics and types of data
Aesthetics describe every aspect of a given graphical element. A few examples are provided in Figure \@ref(fig:common-aesthetics). A critical component of every graphical element is of course its *position,* which describes where the element is located. In standard 2d graphics, we describe positions by an *x* and *y* value, but other coordinate systems and one- or three-dimensional visualizations are possible. Next, all graphical elements have a *shape*, a *size*, and a *color.* Even if we are preparing a black-and-white drawing, graphical elements need to have a color to be visible, for example black if the background is white or white if the background is black. Finally, to the extent we are using lines to visualize data, these lines may have different widths or dash--dot patterns. Beyond the examples shown in Figure \@ref(fig:common-aesthetics), there are many other aesthetics we may encounter in a data visualization. For example, if we want to display text, we may have to specify font family, font face, and font size, and if graphical objects overlap, we may have to specify whether they are partially transparent.
(ref:common-aesthetics) Commonly used aesthetics in data visualization: position, shape, size, color, line width, line type. Some of these aesthetics can represent both continuous and discrete data (position, size, line width, color) while others can only represent discrete data (shape, line type).
```{r common-aesthetics, fig.width = 7.5, fig.asp = 0.45, fig.cap = '(ref:common-aesthetics)'}
aes_pos <- ggdraw() +
geom_segment(data = data.frame(x = c(0, 0.5),
xend = c(1, 0.5),
y = c(0.5, 0),
yend = c(0.5, 1)),
aes(x = x, y = y, xend = xend, yend = yend),
arrow = arrow(length = grid::unit(12, "pt")), size = .75) +
draw_text("y", .5, 1, size = 14, vjust = 1, hjust = 2.5) +
draw_text("x", 1, .5, size = 14, vjust = 2, hjust = 1) +
coord_cartesian(xlim = c(-.2, 1.2), ylim = c(-.2, 1.2))
aes_color <- ggdraw() +
geom_tile(data = data.frame(x = 0.15 + .2333*(0:3)),
aes(x, y = .5, fill = factor(x)), width = .2, height = .6) +
scale_fill_OkabeIto(guide = "none")
aes_shape <- ggdraw() +
geom_point(data = data.frame(x = (.5 + 0:3)/4),
aes(x, y = .5, shape = factor(x)), size = 8, fill = "grey80") +
scale_shape_manual(values = 21:24)
aes_size <- ggdraw() +
geom_point(data = data.frame(x = (.5 + 0:3)/4),
aes(x, y = .5, size = factor(x)), shape = 21, fill = "grey80") +
scale_size_manual(values = c(2, 5, 8, 11))
aes_lwd <- ggdraw() +
geom_segment(data = data.frame(x = rep(0.05, 4),
xend = rep(0.95, 4),
y = (1.5 + 0:3)/6,
yend = (1.5 + 0:3)/6,
size = 4:1),
aes(x = x, y = y, xend = xend, yend = yend, size = size)) +
scale_size_identity()
aes_ltp <- ggdraw() +
geom_segment(data = data.frame(x = rep(0.05, 4),
xend = rep(0.95, 4),
y = (1.5 + 0:3)/6,
yend = (1.5 + 0:3)/6,
linetype = 4:1),
aes(x = x, y = y, xend = xend, yend = yend, linetype = linetype), size = 1) +
scale_linetype_identity()
plot_grid(aes_pos, aes_shape, aes_size,
aes_color, aes_lwd, aes_ltp,
ncol = 3,
labels = c("position", "shape", "size", "color", "line width", "line type"),
label_x = 0.05, label_y = 0.95, hjust = 0, vjust = 1,
label_fontface = "plain")
```
All aesthetics fall into one of two groups: Those that can represent continuous data and those that can not. Continuous data values are values for which arbitrarily fine intermediates exist. For example, time duration is a continuous value. Between any two durations, say 50 seconds and 51 seconds, there are arbitrarily many intermediates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and so on. By contrast, number of persons in a room is a discrete value. A room can hold 5 persons or 6, but not 5.5. For the examples in Figure \@ref(fig:common-aesthetics), position, size, color, and line width can represent continuous data, but shape and line type can only represent discrete data.
Next we'll consider the types of data we may want to represent in our visualization. You may think of data as numbers, but numerical values are only two out of several types of data we may encounter. In addition to continuous and discrete numerical values, data can come in the form of discrete categories, in the form of dates or times, and as text (Table \@ref(tab:basic-data-types)). When data is numerical we also call it *quantitative* and when it is categorical we call it *qualitative*. Variables holding qualitative data are *factors*, and the different categories are called *levels*. The levels of a factor are most commonly without order (as in the example of "dog", "cat", "fish" in Table \@ref(tab:basic-data-types)), but factors can also be ordered, when there is an intrinsic order among the levels of the factor (as in the example of "good", "fair", "poor" in Table \@ref(tab:basic-data-types)).
Table: (\#tab:basic-data-types) Types of variables encountered in typical data visualization scenarios.
---------------------------------------------------------------------------------------------------------------------
Type of variable Examples Appropriate scale Description
------------------------ --------------------- ----------------------- ----------------------------------------------
quantitative/numerical 1.3, 5.7, 83, continuous Arbitrary numerical values. These can be
continuous 1.5x10^-2^ integers, rational numbers, or real numbers.
quantitative/numerical 1, 2, 3, 4 discrete Numbers in discrete units. These are most
discrete commonly but not necessarily integers.
For example, the numbers 0.5, 1.0, 1.5 could
also be treated as discrete if intermediate
values cannot exist in the given dataset.
qualitative/categorical dog, cat, fish discrete Categories without order. These are discrete
unordered and unique categories that have no inherent
order. These variables are
also called *factors*.
qualitative/categorical good, fair, poor discrete Categories with order. These are discrete
ordered and unique categories with an order. For
example, "fair" always lies between "good"
and "poor". These variables are
also called *ordered factors*.
date or time Jan. 5 2018, 8:03am continuous or discrete Specific days and/or times. Also
generic dates, such as July 4 or Dec. 25
(without year).
text The quick brown fox none, or discrete Free-form text. Can be treated
jumps over the lazy as categorical if needed.
dog.
---------------------------------------------------------------------------------------------------------------------
To examine a concrete example of these various types of data, take a look at Table \@ref(tab:data-example). It shows the first few rows of a dataset providing the daily temperature normals (average daily temperatures over a 30-year window) for four U.S. locations. This table contains five variables: month, day, location, station ID, and temperature (in degrees Fahrenheit). Month is an ordered factor, day is a discrete numerical value, location is an unordered factor, station ID is similarly an unordered factor, and temperature is a continuous numerical value.
Table: (\#tab:data-example) First 12 rows of a dataset listing daily temperature normals for four weather stations. Data source: NOAA.
Month Day Location Station ID Temperature
------- ----- ------------ ------------ -------------
Jan 1 Chicago USW00014819 25.6
Jan 1 San Diego USW00093107 55.2
Jan 1 Houston USW00012918 53.9
Jan 1 Death Valley USC00042319 51.0
Jan 2 Chicago USW00014819 25.5
Jan 2 San Diego USW00093107 55.3
Jan 2 Houston USW00012918 53.8
Jan 2 Death Valley USC00042319 51.2
Jan 3 Chicago USW00014819 25.3
Jan 3 San Diego USW00093107 55.3
Jan 3 Death Valley USC00042319 51.3
Jan 3 Houston USW00012918 53.8
## Scales map data values onto aesthetics
To map data values onto aesthetics, we need to specify which data values correspond to which specific aesthetics values. For example, if our graphic has an *x* axis, then we need to specify which data values fall onto particular positions along this axis. Similarly, we may need to specify which data values are represented by particular shapes or colors. This mapping between data values and aesthetics values is created via *scales*. A scale defines a unique mapping between data and aesthetics (Figure \@ref(fig:basic-scales-example)). Importantly, a scale must be one-to-one, such that for each specific data value there is exactly one aesthetics value and vice versa. If a scale isn't one-to-one, then the data visualization becomes ambiguous.
(ref:basic-scales-example) Scales link data values to aesthetics. Here, the numbers 1 through 4 have been mapped onto a position scale, a shape scale, and a color scale. For each scale, each number corresponds to a unique position, shape, or color and vice versa.
```{r basic-scales-example, fig.width = 5.5, fig.asp = 0.3, fig.cap = '(ref:basic-scales-example)'}
df <- data.frame(x = c(1:4))
scale_num <- ggplot(df, aes(x)) +
geom_point(size = 3, color = "#0072B2", y = 1) +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1, label = "position ") +
scale_x_continuous(limits = c(.7, 4.4), breaks = 1:5, labels = c("1", "2", "3", "4", "5"), name = NULL, position = "top") +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text = element_text(size = 14),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.margin = margin(3.5, 20, 3.5, 3.5))
scale_color <- ggplot(df, aes(x, color = factor(x), fill = factor(x))) +
geom_point(size = 5, shape = 22, y = 1) +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1, label = "color ") +
scale_x_continuous(limits = c(.7, 4.4), breaks = NULL) +
scale_color_manual(values = darken(c("#0082A6", "#4EBBB9", "#9CDFC2", "#D8F0CD"), .1), guide = "none") +
scale_fill_manual(values = c("#0082A6", "#4EBBB9", "#9CDFC2", "#D8F0CD"), guide = "none") +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 14),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
plot.margin = margin(3.5, 20, 3.5, 3.5))
scale_shape <- ggplot(df, aes(x, shape = factor(x))) +
geom_point(size = 4, color = "grey30", y = 1, fill = "grey80") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1, label = "shape ") +
scale_x_continuous(limits = c(.7, 4.4), breaks = NULL) +
scale_shape_manual(values = 21:24, guide = "none") +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 14),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
plot.margin = margin(3.5, 20, 3.5, 3.5))
# workaround so ggarrange doesn't produce empty plot
cur_dev <- grDevices::dev.cur()
grDevices::pdf(NULL)
scales_grob <- ggarrange(scale_num, scale_shape, scale_color, ncol = 1)
x <- grDevices::dev.off() # assign output to x to catch it
x <- grDevices::dev.set(cur_dev)
ggdraw(scales_grob)
#plot_grid(scale_num, scale_color, scale_shape, ncol = 1, align = 'v', rel_heights = c(1, 1, .7))
```
Let's put things into practice. We can take the dataset shown in Table \@ref(tab:data-example), map temperature onto the *y* axis, day of the year onto the *x* axis, location onto color, and visualize these aesthetics with solid lines. The result is a standard line plot showing the temperature normals at the four locations as they change during the year (Figure
\@ref(fig:temp-normals-vs-time)).
(ref:temp-normals-vs-time) Daily temperature normals for four selected locations in the U.S. Temperature is mapped to the *y* axis, day of the year to the *x* axis, and location to line color. Data source: NOAA.
```{r temp-normals-vs-time, fig.cap = '(ref:temp-normals-vs-time)'}
temps_long <- filter(ncdc_normals,
station_id %in% c(
"USW00014819", # Chicago, IL 60638
#"USC00516128", # Honolulu, HI 96813
#"USW00027502", # Barrow, AK 99723, coldest point in the US
"USC00042319", # Death Valley, CA 92328 hottest point in the US
"USW00093107", # San Diego, CA 92145
#"USC00427606" # Salt Lake City, UT 84103
"USW00012918" # Houston, TX 77061
)) %>%
mutate(location = fct_recode(factor(station_id),
"Chicago" = "USW00014819",
#"Honolulu, HI" = "USC00516128",
#"Barrow, AK" = "USW00027502",
"Death Valley" = "USC00042319",
"San Diego" = "USW00093107",
#"Salt Lake City, UT" = "USC00427606",
"Houston" = "USW00012918")) %>%
mutate(location = factor(location, levels = c("Death Valley", "Houston", "San Diego", "Chicago")))
ggplot(temps_long, aes(x = date, y = temperature, color = location)) +
geom_line(size = 1) +
scale_x_date(name = "month", limits = c(ymd("0000-01-01"), ymd("0001-01-04")),
breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
ymd("0000-10-01"), ymd("0001-01-01")),
labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(1/366, 0)) +
scale_y_continuous(limits = c(15, 110),
breaks = seq(20, 100, by = 20),
name = "temperature (°F)") +
scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
theme_minimal_grid() +
theme(legend.title.align = 0.5)
```
Figure \@ref(fig:temp-normals-vs-time) is a fairly standard visualization for a temperature curve and likely the visualization most data scientists would intuitively choose first. However, it is up to us which variables we map onto which scales. For example, instead of mapping temperature onto the *y* axis and location onto color, we can do the opposite. Because now the key variable of interest (temperature) is shown as color, we need to show sufficiently large colored areas for the color to convey useful information. Therefore, for this visualization I have chosen squares instead of lines, one for each month and location, and I have colored them by the average temperature normal for each month (Figure \@ref(fig:four-locations-temps-by-month)).
(ref:four-locations-temps-by-month) Monthly normal mean temperatures for four locations in the U.S. Data source: NOAA
```{r four-locations-temps-by-month, fig.width = 8.5, fig.asp = .3, fig.cap = '(ref:four-locations-temps-by-month)'}
month_names <- c("01" = "Jan", "02" = "Feb", "03" = "Mar", "04" = "Apr", "05" = "May", "06" = "Jun",
"07" = "Jul", "08" = "Aug", "09" = "Sep", "10" = "Oct", "11" = "Nov", "12" = "Dec")
mean_temps <- temps_long %>%
group_by(location, month) %>%
summarize(mean = mean(temperature)) %>%
ungroup() %>%
mutate(month = month_names[month]) %>%
mutate(month = factor(month, levels = unname(month_names)))
p <- ggplot(mean_temps, aes(x = month, y = location, fill = mean)) +
geom_tile(width = .95, height = 0.95) +
scale_fill_viridis_c(option = "B", begin = 0.15, end = 0.98,
name = "temperature (°F)") +
scale_y_discrete(name = NULL) +
coord_fixed(expand = FALSE) +
theme_half_open() +
theme(axis.line = element_blank(),
axis.ticks = element_blank(),
#axis.text.y = element_text(size = 14),
legend.title = element_text(size = 12))
# fix legend (make it centered)
ggdraw(align_legend(p))
```
I would like to emphasize that Figure \@ref(fig:four-locations-temps-by-month) uses two position scales (month along the *x* axis and location along the *y* axis) but neither is a continuous scale. Month is an ordered factor with 12 levels and location is an unordered factor with four levels. Therefore, the two position scales are both discrete. For discrete position scales, we generally place the different levels of the factor at an equal spacing along the axis. If the factor is ordered (as is here the case for month), then the levels need to placed in the appropriate order. If the factor is unordered (as is here the case for location), then the order is arbitrary, and we can choose any order we want. I have ordered the locations from overall coldest (Chicago) to overall hottest (Death Valley) to generate a pleasant staggering of colors. However, I could have chosen any other order and the figure would have been equally valid.
Both Figures \@ref(fig:temp-normals-vs-time) and \@ref(fig:four-locations-temps-by-month) used three scales in total, two position scales and one color scale. This is a typical number of scales for a basic visualization, but we can use more than three scales at once. Figure \@ref(fig:mtcars-five-scale) uses five scales, two position scales, one color scale, one size scale, and one shape scale, and all scales represent a different variable from the dataset.
(ref:mtcars-five-scale) Fuel efficiency versus displacement, for 32 cars (1973--74 models). This figure uses five separate scales to represent data: (i) the *x* axis (displacement); (ii) the *y* axis (fuel efficiency); (iii) the color of the data points (power); (iv) the size of the data points (weight); and (v) the shape of the data points (number of cylinders). Four of the five variables displayed (displacement, fuel efficiency, power, and weight) are numerical continuous. The remaining one (number of cylinders) can be considered to be either numerical discrete or qualitative ordered. Data source: *Motor Trend*, 1974.
```{r mtcars-five-scale, fig.width = 6.7, fig.asp = .8, fig.cap = '(ref:mtcars-five-scale)'}
p_mtcars <- ggplot(mtcars, aes(disp, mpg, fill = hp, shape = factor(cyl), size = wt)) +
geom_point(color = "white") +
scale_shape_manual(values = c(23, 24, 21), name = "cylinders") +
scale_fill_continuous_carto(palette = "Emrld", name = "power (hp)", breaks = c(100, 200, 300)) +
xlab("displacement (cu. in.)") +
ylab("fuel efficiency (mpg)") +
guides(shape = guide_legend(override.aes = list(size = 4, fill = "#329D84")),
size = guide_legend(override.aes = list(shape = 21, fill = "#329D84"),
title = "weight (1000 lbs)")) +
theme_half_open() + background_grid() +
theme(#legend.title = element_text(size = 12),
legend.box.background = element_rect(fill = "white", color = "white"),
legend.position = "top",
legend.direction = "vertical",
legend.justification = "center",
legend.box.margin = margin(7, 7, 7, 7))
legend <- get_legend(align_legend(p_mtcars))
ggdraw() +
draw_plot(p_mtcars + theme(legend.position = "none")) +
draw_grob(legend, x = .36, y = .7, width = .7, height = .3)
```