forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
maximize_data_signal.Rmd
169 lines (128 loc) · 9.8 KB
/
maximize_data_signal.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(readr)
library(lubridate)
```
# Maximizing the data signal in visualizations {#background-grids}
With the rising popularity of the R package ggplot2, which uses a gray background grid as default, graphs with this style have become widespread. With apologies to ggplot2 author Hadley Wickham, for whom I have the utmost respect, I don't find this style particularly attractive. In general, I find that the gray background detracts from the actual data. As an example, consider Figure \@ref(fig:price-plot-ggplot-default), which shows the stock price of four major tech companies, indexed to their value in June 2012. The grid is too busy, and the gray background in the legend is distracting.
(ref:price-plot-ggplot-default) Stock price over time for four major tech companies. The stock price for each company has been normalized to equal 100 in June 2012.
```{r price-plot-ggplot-default, fig.cap='(ref:price-plot-ggplot-default)'}
price_plot <- ggplot(tech_stocks, aes(x=date, y=price_indexed, color=ticker)) +
geom_line() +
scale_color_manual(values=c("#000000", "#E69F00", "#56B4E9", "#009E73"),
name="",
breaks=c("FB", "GOOG", "MSFT", "AAPL"),
labels=c("Facebook", "Alphabet", "Microsoft", "Apple")) +
scale_x_date(name="year",
limits=c(ymd("2012-06-01"), ymd("2017-05-31")),
expand=c(0,0)) +
scale_y_continuous(name="stock price, indexed",
limits = c(0, 560),
expand=c(0,0))
stamp_ugly(price_plot + theme_gray(12))
```
We could try to remove the grid altogether, but I think that is a worse option (Figure \@ref(fig:price-plot-no-grid)). Now the curves seem to just float in space, and it's difficult to see where they go. In addition, since all prices are indexed to 100 in June 2012, at a minimum this value should be marked in the plot. Thus, one option would be to add a thin horizontal line at *y* = 100 (Figure \@ref(fig:price-plot-refline)).
(ref:price-plot-no-grid) Indexed stock price over time for four major tech companies.
```{r price-plot-no-grid, fig.cap='(ref:price-plot-no-grid)'}
stamp_bad(price_plot + theme_half_open())
```
(ref:price-plot-refline) Indexed stock price over time for four major tech companies.
```{r price-plot-refline, fig.cap='(ref:price-plot-refline)'}
price_plot2 <- ggplot(tech_stocks, aes(x=date, y=price_indexed, color=ticker)) +
geom_hline(yintercept = 100, size = 0.5, color="grey70") +
geom_line() +
scale_color_manual(values=c("#000000", "#E69F00", "#56B4E9", "#009E73"),
name="",
breaks=c("FB", "GOOG", "MSFT", "AAPL"),
labels=c("Facebook", "Alphabet", "Microsoft", "Apple")) +
scale_x_date(name="year",
limits=c(ymd("2012-06-01"), ymd("2017-05-31")),
expand=c(0,0)) +
scale_y_continuous(name="stock price, indexed",
limits = c(0, 560),
expand=c(0,0)) +
theme_half_open()
stamp_phantom(price_plot2)
```
Alternatively, we can use just a minimal grid. In particular, for a plot where we are primarily interested in the change in *y* values, vertical grid lines are not needed. Moreover, grid lines positioned at only the major axis ticks will often be sufficient. And, the axis line can be omitted or made very thin (Figure \@ref(fig:price-plot-hgrid)).
(ref:price-plot-hgrid) Indexed stock price over time for four major tech companies.
```{r price-plot-hgrid, fig.cap='(ref:price-plot-hgrid)'}
stamp_phantom(price_plot + theme_minimal_hgrid())
```
For such a minimal grid, we generally draw the lines orthogonally to direction along which the numbers of interest vary. Therefore, if instead of plotting the stock price over time we plot the five-year increase, as horizontal bars, then we will want to use vertical lines instead (Figure \@ref(fig:price-increase)).
(ref:price-increase) Percent increase in stock price from June 2012 to June 2017, for four major tech companies.
```{r price-increase, fig.cap='(ref:price-increase)'}
perc_increase <- filter(tech_stocks, date == ymd("2017-06-01")) %>%
mutate(perc=100*(price-index_price)/index_price,
label=paste(as.character(round(perc)), "%", sep="")) %>%
arrange(perc)
perc_increase$ticker <- factor(perc_increase$ticker, levels=perc_increase$ticker)
perc_plot <- ggplot(perc_increase, aes(x=ticker, y=perc)) +
geom_col(fill="#56B4E9") +
geom_text(aes(label=label), color="white", hjust=1.1, size=5) +
scale_y_continuous(#name="percent increase\n(June 2012 to June 2017)",
name="percent increase",
limits=c(0, 499),
expand=c(0, 0)) +
scale_x_discrete(name="",
breaks=c("FB", "GOOG", "MSFT", "AAPL"),
labels=c("Facebook", "Alphabet", "Microsoft", "Apple")) +
coord_flip() +
theme_minimal_vgrid() +
theme(axis.title.y=element_blank()) # remove the unnecessary space generated by an empty label
stamp_phantom(perc_plot)
```
```{block type='rmdtip', echo=TRUE}
Grid lines that run perpendicular to the key variable of interest tend to be the most useful.
```
Background grids along both axis directions can make sense, however, for scatter plots where there is no primary axis of interest. This presentation frequently looks best without axis lines. Figure \@ref(fig:Aus-athletes-grid) provides an example.
(ref:Aus-athletes-grid) Percent bodyfat versus height in professional male Australian athletes. (Each point represents one athlete.)
```{r Aus-athletes-grid, fig.width = 7, fig.cap='(ref:Aus-athletes-grid)'}
male_Aus <- filter(Aus_athletes, sex=="m") %>%
filter(sport %in% c("basketball", "field", "swimming", "track (400m)",
"track (sprint)", "water polo")) %>%
mutate(sport = case_when(sport == "track (400m)" ~ "track",
sport == "track (sprint)" ~ "track",
TRUE ~ sport))
male_Aus$sport <- factor(male_Aus$sport,
levels = c("field", "water polo", "basketball", "swimming", "track"))
ggplot(male_Aus, aes(x=height, y=pcBfat, color=sport, fill=sport, shape=sport)) +
geom_point(size = 2.5) +
scale_shape_manual(values = 21:25) +
scale_color_OkabeIto(order=c(2, 1, 3, 4, 5), darken = 0.3) +
scale_fill_OkabeIto(order=c(2, 1, 3, 4, 5), darken = 0.1, alpha = 0.7) +
ylab("% bodyfat") + theme_minimal_grid()
```
For figures where the relevant comparison is the *x* = *y* line, I prefer to draw a diagonal line rather than a grid. For example, consider Figure \@ref(fig:echave-et-al), which compares two sets of correlations for 209 protein structures. By drawing the diagonal line, we can see immediately which correlations are systematically stronger. The same observation is much harder to make when the figure has a background grid instead (Figure \@ref(fig:echave-et-al-grid)). Thus, even though Figure \@ref(fig:echave-et-al-grid) looks pleasing, I label it as bad.
(ref:echave-et-al) Correlations between evolutionary conservation and structural features of sites in 209 proteins. Along the *y* axis, we plot the correlation between evolutonary conservation (measured as evolutionary rate) at individual sites in a protein and the relative solvent accessibility (RSA) of those sites in the protein structure. Along the *x* axis, we plot the correlation between rate and weighted contact number (WCN), a measure for the density of contacts of a site in the protein structure. Each point represents one protein. We see that evolutionary conservation and structural features are highly correlated in some proteins and not very much in others. We also see that WCN, on average, yields somewhat stronger correlations than RSA does. Adapted from @Echave-et-al-2016.
```{r echave-et-al, fig.cap='(ref:echave-et-al)'}
cor_table = read_csv("datasets/Echave_et_al_2016_NRG_correlations.csv")
nrg_plot <- ggplot(cor_table, aes(y=cor.RSA, x=-cor.WCN)) +
geom_abline(color="#999999") +
geom_point(shape=21, color="black", fill="#0072B2", size=2, stroke=0.5) +
scale_x_continuous(limits = c(0.25, 0.78), breaks = c(.3, .5, .7), labels = c("-0.3", "-0.5", "-0.7")) +
scale_y_continuous(limits = c(0.25, 0.78), breaks = c(.3, .5, .7)) +
coord_fixed() +
xlab("rate–WCN correlation") +
ylab("rate–RSA correlation") +
labs(title = "") +
theme_half_open() # add an empty title as a simple trick to move the plot downwards
stamp_phantom(plot_grid(nrg_plot, NULL, nrow=1, rel_widths=c(1, 0.2)))
```
(ref:echave-et-al-grid) Correlations between evolutionary conservation and structural features of sites in 209 proteins. By plotting this dataset against a background grid, the systematic shift of all points away from the *x* = *y* line is obscured.
```{r echave-et-al-grid, fig.cap='(ref:echave-et-al-grid)'}
nrg_plot2 <- ggplot(cor_table, aes(y=cor.RSA, x=-cor.WCN)) +
#geom_abline(color="#999999") +
geom_point(shape=21, color="black", fill="#0072B2", size=2, stroke=0.5) +
scale_x_continuous(limits = c(0.25, 0.78), breaks = c(.3, .5, .7), labels = c("-0.3", "-0.5", "-0.7")) +
scale_y_continuous(limits = c(0.25, 0.78), breaks = c(.3, .5, .7)) +
coord_fixed() +
xlab("rate–WCN correlation") +
ylab("rate–RSA correlation") +
labs(title = "") + # add an empty title as a simple trick to move the plot downwards
theme_half_open() +
background_grid(size.major=0.5)
stamp_bad(plot_grid(nrg_plot2, NULL, nrow=1, rel_widths=c(1, 0.2)))
```
In summary, there is no simple choice of background grid that always works. I encourage you to think carefully about which specific grid or guidelines are most informative for the plot you are making, and to only show those. In general, less is more. Too many or overly thick and dark grid lines can distract your reader's attention away from the data you want to show.