forked from jtr13/EDAV
-
Notifications
You must be signed in to change notification settings - Fork 0
/
histogram.Rmd
executable file
·207 lines (156 loc) · 7.35 KB
/
histogram.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# Chart: Histogram {#histo}
![](images/banners/banner_histogram.png)
## Overview
This section covers how to make histograms.
## tl;dr
Gimme a full-fledged example!
Here's an application of histograms that looks at how the beaks of Galapagos finches changed due to external factors:
```{r tldr-finch-example, echo=FALSE}
library(Sleuth3) # data
library(ggplot2) # plotting
# load data
finches <- Sleuth3::case0201
# finch histograms by year with overlayed density curves
ggplot(finches, aes(x = Depth, y = ..density..)) +
# plotting
geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
geom_density(color = "#3D6480") +
facet_wrap(~Year) +
# formatting
ggtitle("Severe Drought Led to Finches with Bigger Chompers",
subtitle = "Beak Depth Density of Galapagos Finches by Year") +
labs(x = "Beak Depth (mm)", caption = "Source: Sleuth3::case0201") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
```
And here's the code:
```{r tldr-finch-code, eval=FALSE}
library(Sleuth3) # data
library(ggplot2) # plotting
# load data
finches <- Sleuth3::case0201
# finch histograms by year with overlayed density curves
ggplot(finches, aes(x = Depth, y = ..density..)) +
# plotting
geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
geom_density(color = "#3D6480") +
facet_wrap(~Year) +
# formatting
ggtitle("Severe Drought Led to Finches with Bigger Chompers",
subtitle = "Beak Depth Density of Galapagos Finches by Year") +
labs(x = "Beak Depth (mm)", caption = "Source: Sleuth3::case0201") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
```
For more info on this dataset, type `?Sleuth3::case0201` into the console.
## Simple examples
Whoa whoa whoa! Much simpler please!
Let's use a very simple dataset:
```{r simple-example-hist-data}
# store data
x <- c(50, 51, 53, 55, 56, 60, 65, 65, 68)
```
### Histogram using base R
```{r base-r-hist}
# plot data
hist(x, col = "lightblue", main = "Base R Histogram of x")
```
For the Base R histogram, it's advantages are in it's ease to setup. In truth, all you need to plot the data `x` in question is `hist(x)`, but we included a little color and a title to make it more presentable.
Full documentation on `hist()` can be found [here](https://www.rdocumentation.org/packages/graphics/versions/3.5.0/topics/hist){target="_blank"}
### Histogram using ggplot2
```{r ggplot-hist}
# import ggplot
library(ggplot2)
# must store data as dataframe
df <- data.frame(x)
# plot data
ggplot(df, aes(x)) +
geom_histogram(color = "grey", fill = "lightBlue",
binwidth = 5, center = 52.5) +
ggtitle("ggplot2 histogram of x")
```
The ggplot version is a little more complicated on the surface, but you get more power and control as a result. **Note**: as shown above, ggplot expects a dataframe, so if you are getting an error where "R doesn't know what to do" like this:
![ggplot dataframe error](images/ggplot_df_error.png)
make sure you are using a dataframe.
## Types of histrograms
Use a histogram to show the distribution of *one continuous variable*. The y-scale can be represented in a variety of ways to express different results:
### Frequency or count
y = number of values that fall in each bin
### Relative frequency historgram
y = number of values that fall in each bin / total number of values
### Cumulative frequency histogram
y = total number of values <= (or <) right boundary of bin
### Density
y = relative frequency / binwidth
## Parameters
### Bin boundaries
Be mindful of the boundaries of the bins and whether a point will fall into the left or right bin if it is on a boundary.
```{r bin-boundaries}
# format layout
op <- par(mfrow = c(1, 2), las = 1)
# right closed
hist(x, col = "lightblue", ylim = c(0, 4),
xlab = "right closed ex. (55, 60]", font.lab = 2)
# right open
hist(x, col = "lightblue", right = FALSE, ylim = c(0, 4),
xlab = "right open ex. [55, 60)", font.lab = 2)
```
### Bin number
The default bin number of 30 in ggplot2 is not always ideal, so consider altering it if things are looking strange. You can specify the width explicitly with `binwidth` or provide the desired number of bins with `bins`.
```{r}
# default...note the pop-up about default bin number
ggplot(finches, aes(x = Depth)) +
geom_histogram() +
ggtitle("Default with pop-up about bin number")
```
Here are examples of changing the bins using the two ways described above:
```{r fixed-histograms-binwidth}
# using binwidth
p1 <- ggplot(finches, aes(x = Depth)) +
geom_histogram(binwidth = 0.5, boundary = 6) +
ggtitle("Changed binwidth value")
# using bins
p2 <- ggplot(finches, aes(x = Depth)) +
geom_histogram(bins = 48, boundary = 6) +
ggtitle("Changed bins value")
# format plot layout
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
```
### Bin alignment
Make sure the axes reflect the true boundaries of the histogram. You can use `boundary` to specify the endpoint of any bin or `center` to specify the center of any bin. `ggplot2` will be able to calculate where to place the rest of the bins (Also, notice that when the boundary was changed, the number of bins got smaller by one. This is because by default the bins are centered and go over/under the range of the data.)
```{r}
df <- data.frame(x)
# default alignment
ggplot(df, aes(x)) +
geom_histogram(binwidth = 5,
fill = "lightBlue", col = "black") +
ggtitle("Default Bin Alignment")
```
```{r alignment-fix}
# specify alignment with boundary
p3 <- ggplot(df, aes(x)) +
geom_histogram(binwidth = 5, boundary = 60,
fill = "lightBlue", col = "black") +
ggtitle("Bin Alignment Using boundary")
# specify alignment with center
p4 <- ggplot(df, aes(x)) +
geom_histogram(binwidth = 5, center = 67.5,
fill = "lightBlue", col = "black") +
ggtitle("Bin Alignment Using center")
# format layout
library(gridExtra)
grid.arrange(p3, p4, ncol = 2)
```
**Note**: Don't use both `boundary` *and* `center` for bin alignment. Just pick one.
## Theory
* For more info about histograms and continuous variables, check out [Chapter 3](http://www.gradaanwr.net/content/03-examining-continuous-variables/){target="_blank"} of the textbook.
## External resources
- [DataCamp ggplot2 Histograms Exercise](https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-4-geometries?ex=5){target="_blank"}: Simple interactive example of histograms with ggplot2
- [DataCamp Histogram with Basic R](https://www.datacamp.com/community/tutorials/make-histogram-basic-r){target="_blank"}: "Tutorial for new R users whom need an accessible and easy-to-understand resource on how to create their own histogram with basic R.
" 'Nuff said.
- [DataCamp Histogram with ggplot2](https://www.datacamp.com/community/tutorials/make-histogram-ggplot2){target="_blank"}: Great article on making histograms with ggplot2.
- [hist documentation](https://www.rdocumentation.org/packages/graphics/versions/3.5.0/topics/hist){target="_blank"}: base R histogram documentation page.
- [ggplot2 cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf){target="_blank"}: Always good to have close by.