forked from jtr13/EDAV
-
Notifications
You must be signed in to change notification settings - Fork 0
/
iris.Rmd
executable file
·330 lines (226 loc) · 13.4 KB
/
iris.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
# (PART) Walkthroughs (Red) {-}
# Walkthrough: Iris Example {#iris}
![](images/banners/banner_iris.png)
## Overview
This example goes through some work with the `iris` dataset to get to a finished scatterplot that is ready to present.
### tl;dr
Here's what we end up with:
```{r goal}
library(ggplot2)
base_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") +
xlab("Sepal Length (cm)") +
ylab("Sepal Width (cm)") +
ggtitle("Sepal Dimensions in Different Species of Iris Flowers")
base_plot + theme_minimal()
```
Wondering how we got there? Read on.
### Packages
* [`ggplot2`](https://www.rdocumentation.org/packages/ggplot2){target="_blank"}
* [`dplyr`](https://www.rdocumentation.org/packages/dplyr){target="_blank"}
* [`stats`](https://www.rdocumentation.org/packages/stats){target="_blank"}
* [`Base`](https://www.rdocumentation.org/packages/base){target="_blank"}
* [`datasets`](https://www.rdocumentation.org/packages/datasets){target="_blank"}
* [(`gridExtra`)](https://www.rdocumentation.org/packages/gridExtra){target="_blank"}
### Techniques
* Keyboard Shortcuts
* Viewing Data Structure/Dimensions/etc.
* Accessing Documentation
* Plotting with `ggplot2`
* Layered Nature of `ggplot2`/Grammar of Graphics
* Mapping aesthetics in `ggplot2`
* Overlapping Data: alpha and jitter
* Presenting Graphics
* Themes
***
## Quick note on doing it the lazy way
Shortcuts are your best friend to get work done faster. And they are easy to find.
In the toolbar:
`Tools > Keyboard Shortcuts Help` OR `⌥⇧K`
Some good ones:
* Insert assignment operator (`<-`): Alt/Option+-
* Insert pipe (`%>%`): Ctrl/Cmd+Shift+M
* Comment Code: Ctrl/Cmd+Shift+C
* Run current line/selection: Ctrl/Cmd+Enter
* Re-run previous region: Ctrl/Cmd+Shift+P
Be on the lookout for things you do often and try to see if there is a faster way to do them.
Additionally, the RStudio IDE can be a little daunting, but it is full of useful tools that you can read about in [this cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf){target="_blank"} or go through with this DataCamp course: [Part 1](https://www.datacamp.com/courses/working-with-the-rstudio-ide-part-1){target="_blank"}, [Part 2](https://www.datacamp.com/courses/working-with-the-rstudio-ide-part-2){target="_blank"}.
Okay, now let's get to it...
***
## Viewing data
Let's start with loading the package so we can get the data as a dataframe.
```{r import_data}
library(datasets)
class(iris)
```
This is not a huge dataset, but it is helpful to get into the habit of treating datasets as large no matter what. Because of this, make sure you inspect the size and structure of your dataset before going and printing it to the console.
Here we can see that we have 150 observations across 5 different variables.
```{r dim_view}
dim(iris)
```
There are a bunch of ways to get information on your dataset. Here are a few:
```{r other_views, message=FALSE}
str(iris)
summary(iris)
# This one requires dplyr, but it's worth it :)
library(dplyr)
glimpse(iris)
```
Plotting the data by calling `iris` to the console will print the whole thing. Go ahead and try it in this case, but this is not recommended for larger datasets. Instead, use `head()` in the console or `View()`.
If you want to learn more about these commands, or anything for that matter, just type `?<command>` into the console. `?head`, for example, will reveal that there is an additional argument to `head` called `n` for the number of lines printed, which defaults to 6. Also, you may notice there is something called `tail`. I wonder what that does? <i class="far fa-smile"></i>
## Plotting data
Let's plot something!
```{r blank_plot}
# Something's missing
library(ggplot2)
ggplot(iris)
```
Where is it? Maybe if we add some aesthetics. I remember that was an important word that came up somewhere:
```{r blank_plot_aes}
# Still not working...
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))
```
Still nothing. Remember, you have to add a `geom` for something to show up.
```{r first_ggplot}
# There we go!
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
```
Yay! Something showed up! Notice where we put the `data`, inside of `ggplot()`. ggplot is built on layers. Here we put it in the main call to `ggplot`. The `data` argument is also available in `geom_point()`, but in that case it would only apply to that layer. Here, we are saying, for all layers, unless specified, make the `data` be `iris`.
Now let's add a color mapping by `Species`:
```{r ggplot_colored}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species))
```
Usually it is helpful to store the main portion of the plot in a variable and add on the layers. The code below achieves the same output as above:
```{r ggplot_color_and_variables, eval = FALSE}
sepal_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width))
sepal_plot +
geom_point(aes(color = Species))
```
## Markdown etiquette
I'm seeing that my R Markdown file is getting a little messy. Working with markdown and chunks can get out of hand, but there are some helpful tricks. First, consider naming your chunks as you go. If you combine this with headers, your work will be much more organized. Specifically, the little line at the bottom of the editor becomes much more useful.
From this:
![Pic](images/03-iris-jumpbar.png)
To this:
![Pic](images/03-iris-jumpbar-formatted.png)
Just add a name to the start of each chunk:
`{r <cool-code-chunk-name>, <chunk_option> = TRUE}`
Now you can see what the chunks were about as well as get a sense of where you are in the document. Just don't forget, it is a space after the `r` and commas for the other chunk options you may have like `eval` or `echo`.
## Overlapping data
Eagle-eyed viewers may notice that we seem to be a few points short. We should be seeing 150 points, but we only see 117 (yes, I counted). Where are those 33 missing points? They are actually hiding behind other points. This dataset rounds to the nearest tenth of a centimeter, which is what is giving us those regular placings of the points. How did I know the data was in centimeters? Running `?iris` in the console of course! [Ah, you ask a silly question, you get a silly answer](https://youtu.be/UIKGV2cTgqA?t=3m10s){target="_blank"}.
```{r overlapping_data_repeated}
# This plot hides some of the points
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species))
```
What's the culprit? The `color` aesthetic. The color by default is opaque and will hide any points that are behind it. As a rule, it is always beneficial to reduce the opacity a little no matter what to avoid this problem. To do this, change the `alpha` value to something other than it's default `1`, like `0.5`.
```{r overlap_alpha}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species, alpha = 0.5))
```
Okay...a couple things with this.
### First: the legend
**First**, did you notice the new addition to the legend? That looks silly! Why did that show up? Well, when we added the `alpha` into `aes()`, we got a new legend. Let's look at what we are doing with `geom_point()`. Specifically, this is saying how we should map the `color` and `alpha`:
`geom_point(mapping = aes(color = Species, alpha = 0.5))`
So, we are mapping these given *aesthetics*, `color` and `alpha`, to certain values. `ggplot` knows that usually the aesthetic mapping will vary since you are probably passing in data that varies, so it will create a legend for each mapping. However, we don't need a legend for the `alpha`: we explicitly set it to be 0.5. To fix this, we can pull `alpha` out of `aes` and instead treat it like an *attribute*:
```{r overlap_alpha_fix1}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), alpha = 0.5)
```
No more legend. So, in ggplot, there is a difference between where an aesthetic is placed. It is also called MAPPING an aesthetic (making it vary with data inside `aes`) or SETTING an aesthetic (make it a constant attribute across all datapoints outside of `aes`).
### Second: jittering
**Secondly**, did this alpha trick *really* help us? Are we able to see anything in the plot in an easier way? Not really. Since the points perfectly overlap, the opacity difference doesn't help us much. Usually, opacity will work, but here the data is so regular that we don't gain anything in the perception department.
We can fix this by introducing some *jitter* to the datapoints. Jitter adds a little random noise and moves the datapoints so that they don't fully overlap:
```{r overlap-jitter-fix1}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), alpha = 0.5, position = "jitter")
```
Consider your motives when using jittering. You are by definition altering the data, but it may be beneficial in some situations.
***
### *Aside*: example where alpha blending works
We are dealing with a case where jittering works best to see the data, while changing the `alpha` doesn't help us much. Here's a quick example where opacity using `alpha` might be more directly helpful.
```{r opacity-useful-aside, message=FALSE}
# lib for arranging plots side by side
library(gridExtra)
# make some normally distributed data
x_points <- rnorm(n = 10000, mean = 0, sd = 2)
y_points <- rnorm(n = 10000, mean = 6, sd = 2)
df <- data.frame(x_points, y_points)
# plot with/without changed alpha
plt1 <- ggplot(df, aes(x_points, y_points)) +
geom_point() +
ggtitle("Before (alpha = 1)")
plt2 <- ggplot(df, aes(x_points, y_points)) +
geom_point(alpha = 0.1) +
ggtitle("After (alpha = 0.1)")
# arrange plots
gridExtra::grid.arrange(plt1, plt2,
ncol = 2, nrow = 1)
```
Here it is much easier to see where the dataset is concentrated.
***
## Formatting for presentation
Let's say we have finished this plot and we are ready to present it to other people:
```{r present-plot, echo = FALSE}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), alpha = 0.5, position = "jitter")
```
We should clean it up a bit so it can stand on its own.
## Alter appearance
First, let's make the `x`/`y` labels a little cleaner and more descriptive:
```{r clean-xy-labels}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), alpha = 0.5, position = "jitter") +
xlab("Sepal Length (cm)") +
ylab("Sepal Width (cm)")
```
Next, add a title that encapsulates the plot:
```{r title_added}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), alpha = 0.5, position = "jitter") +
xlab("Sepal Length (cm)") +
ylab("Sepal Width (cm)") +
ggtitle("Sepal Dimensions in Different Species of Iris Flowers")
```
And make the points a little bigger:
```{r size_fixed}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") +
xlab("Sepal Length (cm)") +
ylab("Sepal Width (cm)") +
ggtitle("Sepal Dimensions in Different Species of Iris Flowers")
```
Now it's looking presentable.
## Consider themes
It may be better for your situation to change the theme of the plot (the background, axes, etc.; the "accessories" of the plot). Explore what different themes can offer and pick one that is right for you.
```{r themes}
base_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") +
xlab("Sepal Length (cm)") +
ylab("Sepal Width (cm)") +
ggtitle("Sepal Dimensions in Different Species of Iris Flowers")
base_plot
base_plot + theme_light()
base_plot + theme_minimal()
base_plot + theme_classic()
base_plot + theme_void()
```
I'm going to go with `theme_minimal()` this time.
***
So here we are! We got a lovely scatterplot ready to show the world!
```{r final-result, echo = FALSE}
base_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") +
xlab("Sepal Length (cm)") +
ylab("Sepal Width (cm)") +
ggtitle("Sepal Dimensions in Different Species of Iris Flowers")
base_plot + theme_minimal()
```
## Going deeper
We have just touched the surface of ggplot and dipped our toes into grammar of graphics. If you want to go deeper, I highly recommend the DataCamp courses on *Data Visualization with ggplot2* with [Rick Scavetta](https://www.datacamp.com/instructors/rickscavetta){target="_blank"}. There are three parts and they are quite dense, but the [first part](https://www.datacamp.com/courses/data-visualization-with-ggplot2-1){target="_blank"} is definitely worth checking out.
## Helpful links
[RStudio ggplot2 Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf){target="_blank"}
[DataCamp: Mapping aesthetics to things in ggplot](https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-3-aesthetics?ex=1){target="_blank"}
[R Markdown Reference Guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf){target="_blank"}
[R for Data Science](http://r4ds.had.co.nz/){target="_blank"}