forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
avoid_line_drawings.Rmd
211 lines (167 loc) · 15.4 KB
/
avoid_line_drawings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(forcats)
library(ggridges) # for geom_density_line
```
# Avoid line drawings {#avoid-line-drawings}
Whenever possible, visualize your data with solid, colored shapes rather than with lines that outline those shapes. Solid shapes are more easily perceived, are less likely to create visual artifacts or optical illusions, and do more immediately convey amounts than do outlines. In my experience, visualizations using solid shapes are both clearer and more pleasant to look at than equivalent versions that use line drawings. Thus, I avoid line drawings as much as possible. However, I want to emphasize that this recommendation does not supersede the principle of proportional ink (Chapter \@ref(proportional-ink)).
Line drawings have a long history in the field of data visualization because throughout most of the 20th century, scientific visualizations were drawn by hand and had to be reproducible in black-and-white. This precluded the use of areas filled with solid colors, including solid gray-scale fills. Instead, filled areas were sometimes simulated by applying hatch, cross-hatch, or stipple patterns. Early plotting software imitated the hand-drawn simulations and similarly made extensive use of line drawings, dashed or dotted line patterns, and hatching. While modern visualization tools and modern reproduction and publishing platforms have none of the earlier limitations, many plotting applications still default to outlines and empty shapes rather than filled areas. To raise your awareness of this issue, here I'll show you several examples of the same figures drawn with both lines and filled shapes.
The most common and at the same time most inappropriate use of line drawings is seen in histograms and bar plots. The problem with bars drawn as outlines is that it is not immediately apparent which side of any given line is inside a bar and which side is outside. As a consequence, in particular when there are gaps between bars, we end up with a confusing visual pattern that detracts from the main message of the figure (Figure \@ref(fig:titanic-ages-lines)). Filling the bars with a light color, or with gray if color reproduction is not possible, avoids this problem (Figure \@ref(fig:titanic-ages-filled)).
(ref:titanic-ages-lines) Histogram of the ages of Titanic passengers, drawn with empty bars. The empty bars create a confusing visual pattern. In the center of the histogram, it is difficult to tell which parts are inside of bars and which parts are outside.
```{r titanic-ages-lines, fig.cap='(ref:titanic-ages-lines)'}
titanic <- titanic_all
age_hist_3 <- data.frame(age = (1:25)*3-1.5,
count = hist(titanic$age, breaks=(0:25)*3 + .01, plot = FALSE)$counts)
h3_bad <- ggplot(age_hist_3, aes(x = age, y = count)) + geom_col(width = 2., fill = "transparent",
color = "black") +
scale_y_continuous(limits = c(0, 86), expand = c(0, 0), breaks = 25*(0:5)) +
scale_x_continuous(limits = c(0, 75), expand = c(0, 0)) +
theme_half_open() +
background_grid(major = "y", minor = "none")
stamp_bad(h3_bad)
```
(ref:titanic-ages-filled) The same histogram of Figure \@ref(fig:titanic-ages-lines), now drawn with filled bars. The shape of the age distribution is much more easily discernible in this variation of the figure.
```{r titanic-ages-filled, fig.cap='(ref:titanic-ages-filled)'}
h3_good <- ggplot(age_hist_3, aes(x = age, y = count)) + geom_col(width = 2.7, fill = "#56B4E9") +
scale_y_continuous(limits = c(0, 86), expand = c(0, 0), breaks = 25*(0:5)) +
scale_x_continuous(limits = c(0, 75), expand = c(0, 0)) +
theme_half_open() +
background_grid(major = "y", minor = "none")
stamp_phantom(h3_good)
```
Next, let's take a look at an old-school density plot. I'm showing density estimates for the sepal-length distributions of three species of iris, drawn entirely in black-and-white as a line drawing (Figure \@ref(fig:iris-densities-lines)). The distributions are shown just by their outlines, and because the figure is in black-and-white, we're using different line styles to distinguish them. This figure has two main problems. First, the dashed line styles do not provide a clear separation between the area under the curve and the area above it. While our visual system is quite good at connecting the individual line elements into a continuous line, the dashed lines nevertheless look porous and do not serve as a strong boundary of the enclosed area. Second, because the lines intersect and the areas they enclose are not shaded, we can perceive six different distinct shapes and we need to expend some additional mental energy to merge the correct shapes together. This effect would have been even stronger had I used solid rather than dashed lines for all three distributions.
(ref:iris-densities-lines) Density estimates of the sepal lengths of three different iris species. The broken line styles used for versicolor and virginica detract from the perception that the areas under the curves are distinct from the areas above them.
```{r iris-densities-lines, fig.cap='(ref:iris-densities-lines)'}
# compute densities for sepal lengths
iris_dens <- group_by(iris, Species) %>%
do(ggplot2:::compute_density(.$Sepal.Length, NULL)) %>%
rename(Sepal.Length = x)
# get the maximum values
iris_max <- filter(iris_dens, density == max(density)) %>%
ungroup() %>%
mutate(hjust = c(0, 0.4, 0),
vjust = c(1, 0, 1),
nudge_x = c(0.11, 0, 0.2),
nudge_y = c(-0.02, 0.02, -0.02)
)
iris_lines <- ggplot(iris_dens, aes(x = Sepal.Length, y = density, linetype = Species)) +
geom_density_line(stat = "identity", alpha = 0., size = 0.75) +
geom_text(data = iris_max, aes(label = Species, hjust = hjust, vjust = vjust,
x = Sepal.Length + nudge_x,
y = density + nudge_y),
inherit.aes = FALSE, size = 12/.pt) +
scale_linetype_manual(values = c(1, 3, 4),
breaks = c("virginica", "versicolor", "setosa"),
guide = "none") +
scale_x_continuous(expand = c(0, 0), name = "sepal length") +
scale_y_continuous(limits = c(0, 1.5), expand = c(0, 0)) +
theme_half_open()
stamp_bad(iris_lines)
```
We can attempt to address the problem of porous boundaries by using colored lines rather than dashed lines (Figure \@ref(fig:iris-densities-colored-lines)). However, the density areas in the resulting plot still have little visual presence. Overall, I find the version with filled areas (Figure \@ref(fig:iris-densities-filled)) the most clear and intuitive. It is important, however, to make the filled areas partially transparent, so that the complete distribution for each species is visible.
(ref:iris-densities-colored-lines) Density estimates of the sepal lengths of three different iris species. By using solid, colored lines we have solved the probme of Figure \@ref(fig:iris-densities-lines) that the areas below and above the lines seem to be connected. However, we still don't have a strong sense of the size of the area under each curve.
```{r iris-densities-colored-lines, fig.cap='(ref:iris-densities-colored-lines)'}
iris_colored_lines <- ggplot(iris_dens, aes(x = Sepal.Length, y = density, color = Species)) +
geom_density_line(stat = "identity", alpha = 0., size = 1) +
geom_text(data = iris_max, aes(label = Species, hjust = hjust, vjust = vjust,
x = Sepal.Length + nudge_x,
y = density + nudge_y),
inherit.aes = FALSE, size = 12/.pt) +
scale_color_manual(values = c("#56B4E9", "#E69F00", "#009E73"),
breaks = c("virginica", "versicolor", "setosa"),
guide = "none") +
scale_x_continuous(expand = c(0, 0), name = "sepal length") +
scale_y_continuous(limits = c(0, 1.5), expand = c(0, 0)) +
theme_half_open()
stamp_ugly(iris_colored_lines)
```
(ref:iris-densities-filled) Density estimates of the sepal lengths of three different iris species, shown as partially transparent shaded areas.
```{r iris-densities-filled, fig.cap='(ref:iris-densities-filled)'}
iris_filled <- ggplot(iris_dens, aes(x = Sepal.Length, y = density, fill = Species, color = Species)) +
geom_density_line(stat = "identity") +
geom_text(data = iris_max, aes(label = Species, hjust = hjust, vjust = vjust,
x = Sepal.Length + nudge_x,
y = density + nudge_y),
inherit.aes = FALSE, size = 12/.pt) +
scale_color_manual(values = darken(c("#56B4E9", "#E69F00", "#009E73"), 0.3),
breaks = c("virginica", "versicolor", "setosa"),
guide = "none") +
scale_fill_manual(values = c("#56B4E980", "#E69F0080", "#009E7380"),
breaks = c("virginica", "versicolor", "setosa"),
guide = "none") +
scale_x_continuous(expand = c(0, 0), name = "sepal length") +
scale_y_continuous(limits = c(0, 1.5), expand = c(0, 0)) +
theme_half_open()
stamp_phantom(iris_filled)
```
Line drawings also arise in the context of scatter plots, when different point types are drawn as open circles, triangles, crosses, etc. As an example, consider Figure \@ref(fig:mpg-linespoints). The figure contains a lot of visual noise, and the different point types do not strongly separate from each other. Drawing the same figure with solidly colored shapes addresses this issue (Figure \@ref(fig:mpg-filledpoints)).
(ref:mpg-linespoints) City fuel economy versus engine displacement, for cars with front-wheel drive (FWD), rear-wheel drive (RWD), and all-wheel drive (4WD). The different point styles, all black-and-white line-drawn symbols, create substantial visual noise and make it difficult to read the figure.
```{r mpg-linespoints, fig.width=5.5, fig.asp=.7416, fig.cap='(ref:mpg-linespoints)'}
mpg_linespoints <- ggplot(mpg, aes(y=cty, x=displ, shape=drv)) +
geom_point(size = 2,
position = position_jitter(width = 0.01 * diff(range(mpg$displ)),
height = 0.01 * diff(range(mpg$cty)),
seed = 7384)) +
ylab("fuel economy (mpg)") +
xlab("displacement (l)") +
scale_shape_manual(values=c(4, 1, 2),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
theme(legend.position = c(.7, .8))
stamp_bad(mpg_linespoints)
```
(ref:mpg-filledpoints) City fuel economy versus engine displacement. By using both different colors and different solid shapes for the different drive-train variants, this figure clearly separates the drive-train variants while remaining reproducible in gray scale if needed.
```{r mpg-filledpoints, fig.width=5.5, fig.asp=.7416, fig.cap='(ref:mpg-filledpoints)'}
mpg_filledpoints <- ggplot(mpg, aes(y = cty, x = displ, color = drv, fill = drv, shape = drv)) +
geom_point(size = 3,
position = position_jitter(width = 0.01 * diff(range(mpg$displ)),
height = 0.01 * diff(range(mpg$cty)),
seed = 7384)) +
ylab("fuel economy (mpg)") +
xlab("displacement (l)") +
scale_color_manual(values = c("#202020", "#E69F00", "#56B4E9"),
name = "drive train",
breaks = c("f", "r", "4"),
labels = c("FWD", "RWD", "4WD")) +
scale_fill_manual(values = c("#20202080", "#E69F0080", "#56B4E980"),
name = "drive train",
breaks = c("f", "r", "4"),
labels = c("FWD", "RWD", "4WD")) +
scale_shape_manual(values = c(22, 21, 25),
name = "drive train",
breaks = c("f", "r", "4"),
labels = c("FWD", "RWD", "4WD")) +
theme_half_open() +
theme(legend.position = c(.7, .8))
stamp_phantom(mpg_filledpoints)
```
I strongly prefer solid points over open points, because the solid points have much more visual presence. The argument that I sometimes hear in favor of open points is that they help with overplotting, since the empty areas in the middle of each point allow us to see other points that may be lying underneath. In my opinion, the benefit from being able to see overplotted points does not, in general, outweigh the detriment from the added visual noise of open symbols. There are other approaches for dealing with overplotting, see Chapter \@ref(overlapping-points) for some suggestions.
Finally, let's consider boxplots. Boxplots are commonly drawn with empty boxes, as in Figure \@ref(fig:lincoln-weather-box-empty). I prefer a light shading for the box, as in Figure \@ref(fig:lincoln-weather-box-filled). The shading separates the box more clearly from the plot background, and it helps in particular when we're showing many boxplots right next to each other, as is the case in Figures \@ref(fig:lincoln-weather-box-empty) and \@ref(fig:lincoln-weather-box-filled). In Figure \@ref(fig:lincoln-weather-box-empty), the large number of boxes and lines can again create the illusion of background areas outside of boxes being actually on the inside of some other shape, just as we saw in Figure \@ref(fig:titanic-ages-lines). This problem is eliminated in Figure \@ref(fig:lincoln-weather-box-filled). I have sometimes heard the critique that shading the inside of the box gives too much weight to the center 50% of the data, but I don't buy that argument. It is inherent to the boxplot, shaded box or not, to give more weight to the center 50% of the data than to the rest. If you don't want this emphasis, then don't use a boxplot. Instead, use a violin plot, jittered points, or a sina plot (Chapter \@ref(boxplots-violins)).
(ref:lincoln-weather-box-empty) Distributions of daily mean temperatures in Lincoln, Nebraska, in 2016. Boxes are drawn in the traditional way, without shading.
```{r lincoln-weather-box-empty, fig.cap='(ref:lincoln-weather-box-empty)'}
lincoln_weather %>% mutate(month_short = fct_recode(Month,
Jan = "January",
Feb = "February",
Mar = "March",
Apr = "April",
May = "May",
Jun = "June",
Jul = "July",
Aug = "August",
Sep = "September",
Oct = "October",
Nov = "November",
Dec = "December")) %>%
mutate(month_short = fct_rev(month_short)) -> lincoln_df
lincoln_box_empty <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)) +
geom_boxplot() + xlab("Month")
lincoln_box_empty
```
(ref:lincoln-weather-box-filled) Distributions of daily mean temperatures in Lincoln, Nebraska, in 2016. By giving the boxes a light gray shading, we can make them stand out better against the background.
```{r lincoln-weather-box-filled, fig.cap='(ref:lincoln-weather-box-filled)'}
lincoln_box_filled <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)) +
geom_boxplot(fill = 'grey90') + xlab("Month")
lincoln_box_filled
```