-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy path03-data.Rmd
executable file
·1103 lines (754 loc) · 68.5 KB
/
03-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data in R {#data_r}
Until now, you've created fairly simple data in R and stored it as a [vector](#funcs). However, most (if not all) of you will have much more complicated datasets from your various experiments and surveys that go well beyond what a vector can handle. Learning how R deals with different types of data and data structures, how to import your data into R and how to manipulate and summarise your data are some of the most important skills you will need to master.
In this Chapter we'll go over the main data types in R and focus on some of the most common data structures. We will also cover how to import data into R from an external file, how to manipulate (wrangle) and summarise data and finally how to export data from R to an external file.
## Data types
Understanding the different types of data and how R deals with these data is important. The temptation is to glaze over and skip these technical details, but beware, this can come back to bite you somewhere unpleasant if you don't pay attention. We've already seen an [example](#r_objs) of this when we tried (and failed) to add two character objects together using the `+` operator.
R has six basic types of data; numeric, integer, logical, complex and character. The keen eyed among you will notice we've only listed five data types here, the final data type is raw which we won't cover as it's not useful 99.99% of the time. We also won't cover complex numbers as we don't have the [imagination][complex_num]!
\
- **Numeric** data are numbers that contain a decimal. Actually they can also be whole numbers but we'll gloss over that.
- **Integers** are whole numbers (those numbers without a decimal point).
- **Logical** data take on the value of either `TRUE` or `FALSE`. There's also another special type of logical called `NA` to represent missing values.
- **Character** data are used to represent string values. You can think of character strings as something like a word (or multiple words). A special type of character string is a *factor*, which is a string but with additional attributes (like levels or an order). We'll cover factors later.
\
R is (usually) able to automatically distinguish between different classes of data by their nature and the context in which they're used although you should bear in mind that R can't actually read your mind and you may have to explicitly tell R how you want to treat a data type. You can find out the type (or class) of any object using the `class()`\index{class()} function.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
num <- 2.2
class(num)
char <- "hello"
class(char)
logi <- TRUE
class(logi)
```
Alternatively, you can ask if an object is a specific class using using a logical test. The `is.[classOfData]()` family of functions will return either a `TRUE` or a `FALSE`.\index{is.numeric()} \index{is.character()} \index{is.logical()}
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
is.numeric(num)
is.character(num)
is.character(char)
is.logical(logi)
```
It can sometimes be useful to be able to change the class of a variable using the `as.[className]()` family of coercion functions, although you need to be careful when doing this as you might receive some unexpected results (see what happens below when we try to convert a character string to a numeric). \index{as.character()} \index{as.numeric()} \index{as.logical()} \index{as.factor()} \index{as.complex()}
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
# coerce numeric to character
class(num)
num_char <- as.character(num)
num_char
class(num_char)
# coerce character to numeric!
class(char)
char_num <- as.numeric(char)
```
Here's a summary table of some of the logical test and coercion functions available to you.\index{is.numeric()} \index{is.factor()}
| Type | Logical test | Coercing |
|:--------------:|:-------------------:|:-----------------:|
| Character | `is.character` | `as.character` |
| Numeric | `is.numeric` | `as.numeric` |
| Logical | `is.logical` | `as.logical` |
| Factor | `is.factor` | `as.factor` |
| Complex | `is.complex` | `as.complex` |
## Data structures
Now that you've been introduced to some of the most important classes of data in R, let’s have a look at some of main structures that we have for storing these data.
### Scalars and vectors {#scal_vecs}
Perhaps the simplest type of data structure is the vector. You've already been introduced to vectors in [Chapter 2](#funcs) although some of the vectors you created only contained a single value. Vectors that have a single value (length 1) are called scalars. Vectors can contain numbers, characters, factors or logicals, but the key thing to remember is that all the elements inside a vector must be of the same class. In other words, vectors can contain either numbers, characters or logicals but not mixtures of these types of data. There is one important exception to this, you can include `NA` (remember this is special type of logical) to denote missing data in vectors with other data types.
\
```{r data_struc, echo=FALSE, out.width="40%", fig.align="center"}
knitr::include_graphics(path = "images/scal_vec.png")
```
### Matrices and arrays {#mat_array}
Another useful data structure used in many disciplines such as population ecology, theoretical and applied statistics is the matrix. A matrix is simply a vector that has additional attributes called dimensions. Arrays are just multidimensional matrices. Again, matrices and arrays must contain elements all of the same data class.
\
```{r data_struc2, echo=FALSE, out.width="50%", fig.align="center"}
knitr::include_graphics(path = "images/mat_array.png")
```
\
A convenient way to create a matrix or an array is to use the `matrix()`\index{martix()} and `array()`\index{array()} functions respectively. Below, we will create a matrix from a sequence 1 to 16 in four rows (`nrow = 4`) and fill the matrix row-wise (`byrow = TRUE`) rather than the default column-wise. When using the `array()` function we define the dimensions using the `dim =` argument, in our case 2 rows, 4 columns in 2 different matrices.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
my_mat <- matrix(1:16, nrow = 4, byrow = TRUE)
my_mat
my_array <- array(1:16, dim = c(2, 4, 2))
my_array
```
Sometimes it's also useful to define row and column names for your matrix but this is not a requirement. To do this use the `rownames()`\index{rownames()} and `colnames()`\index{colnames()} functions.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
rownames(my_mat) <- c("A", "B", "C", "D")
colnames(my_mat) <- c("a", "b", "c", "d")
my_mat
```
Once you've created your matrices you can do useful stuff with them and as you'd expect, R has numerous built in functions to perform matrix operations. Some of the most common are given below. For example, to transpose a matrix we use the transposition function `t()`\index{t()}.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
my_mat_t <- t(my_mat)
my_mat_t
```
To extract the diagonal elements of a matrix and store them as a vector we can use the `diag()`\index{diag()} function.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
my_mat_diag <- diag(my_mat)
my_mat_diag
```
The usual matrix addition, multiplication etc can be performed. Note the use of the `%*%` operator to perform matrix multiplication.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
mat.1 <- matrix(c(2, 0, 1, 1), nrow = 2) # notice that the matrix has been filled
mat.1 # column-wise by default
mat.2 <- matrix(c(1, 1, 0, 2), nrow = 2)
mat.2
mat.1 + mat.2 # matrix addition
mat.1 * mat.2 # element by element products
mat.1 %*% mat.2 # matrix multiplication
```
### Lists {#lists}
The next data structure we will quickly take a look at is a list. Whilst vectors and matrices are constrained to contain data of the same type, lists are able to store mixtures of data types. In fact we can even store other data structures such as vectors and arrays within a list or even have a list of a list. This makes for a very flexible data structure which is ideal for storing irregular or non-rectangular data (see [Chapter 7](#prog_r) for an example).
To create a list we can use the `list()`\index{list()} function. Note how each of the three list elements are of different classes (character, logical, and numeric) and are of different lengths.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
list_1 <- list(c("black", "yellow", "orange"),
c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE),
matrix(1:6, nrow = 3))
list_1
```
Elements of the list can be named during the construction of the list
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
list_2 <- list(colours = c("black", "yellow", "orange"),
evaluation = c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE),
time = matrix(1:6, nrow = 3))
list_2
```
or after the list has been created using the `names()`\index{names()} function.
```{r, echo=TRUE, eval=TRUE, collapse=TRUE}
names(list_1) <- c("colours", "evaluation", "time")
list_1
```
### Data frames {#df}
```{block2, vid-text8, type='rmdvideo'}
Take a look at this [video][dataf-vid] for a quick introduction to data frame objects in R
```
\
By far the most commonly used data structure to store data in is the data frame. A data frame is a powerful two-dimensional object made up of rows and columns which looks superficially very similar to a matrix. However, whilst matrices are restricted to containing data all of the same type, data frames can contain a mixture of different types of data. Typically, in a data frame each row corresponds to an individual observation and each column corresponds to a different measured or recorded variable. This setup may be familiar to those of you who use LibreOffice Calc or Microsoft Excel to manage and store your data. Perhaps a useful way to think about data frames is that they are essentially made up of a bunch of vectors (columns) with each vector containing its own data type but the data type can be different between vectors.
As an example, the data frame below contains the results of an experiment to determine the effect of removing the tip of petunia plants (*Petunia sp.*) grown at 3 levels of nitrogen on various measures of growth (note: data shown below are a subset of the full dataset). The data frame has 8 variables (columns) and each row represents an individual plant. The variables `treat` and `nitrogen` are factors ([categorical][cat-var] variables). The `treat` variable has 2 levels (`tip` and `notip`) and the `nitrogen` level variable has 3 levels (`low`, `medium` and `high`). The variables `height`, `weight`, `leafarea` and `shootarea` are numeric and the variable `flowers` is an integer representing the number of flowers. Although the variable `block` has numeric values, these do not really have any order and could also be treated as a factor (i.e. they could also have been called A and B).
\
```{r import-data-html, eval=knitr::is_html_output(), echo=FALSE, collapse=TRUE}
flowers <- read.table('data/flower.txt', header = TRUE, stringsAsFactors = TRUE)
knitr::kable(rbind(head(flowers), tail(flowers)), row.names = FALSE)
```
```{r import-data-latex, eval=knitr::is_latex_output(), echo=F, collapse=T, warning=F, message=F}
library(kableExtra)
flowers <- read.table('data/flower.txt', header = TRUE)
knitr::kable(rbind(head(flowers), tail(flowers)), row.names = FALSE, "latex", booktabs = T) %>%
kable_styling(latex_options = "striped")
```
\
There are a couple of important things to bear in mind about data frames. These types of objects are known as rectangular data (or tidy data) as each column must have the same number of observations. Also, any missing data should be recorded as an `NA` just as we did with our vectors.
We can construct a data frame from existing data objects such as vectors using the `data.frame()`\index{data.frame()} function. As an example, let's create three vectors `p.height`, `p.weight` and `p.names` and include all of these vectors in a data frame object called `dataf`.
```{r dataf, echo=TRUE, collapse=TRUE}
p.height <- c(180, 155, 160, 167, 181)
p.weight <- c(65, 50, 52, 58, 70)
p.names <- c("Joanna", "Charlotte", "Helen", "Karen", "Amy")
dataf <- data.frame(height = p.height, weight = p.weight, names = p.names)
dataf
```
You'll notice that each of the columns are named with variable name we supplied when we used the `data.frame()` function. It also looks like the first column of the data frame is a series of numbers from one to five. Actually, this is not really a column but the name of each row. We can check this out by getting R to return the dimensions of the `dataf` object using the `dim()`\index{dim()} function. We see that there are 5 rows and 3 columns.
```{r dataf2, echo=TRUE, collapse=TRUE}
dim(dataf) # 5 rows and 3 columns
```
Another really useful function which we use all the time is `str()`\index{str()} which will return a compact summary of the structure of the data frame object (or any object for that matter).
```{r dataf3, echo=TRUE, collapse=TRUE}
str(dataf)
```
The `str()` function gives us the data frame dimensions and also reminds us that `dataf` is a `data.frame` type object. It also lists all of the variables (columns) contained in the data frame, tells us what type of data the variables contain and prints out the first five values. We often copy this summary and place it in our R scripts with comments at the beginning of each line so we can easily refer back to it whilst writing our code. We showed you how to comment blocks in RStudio [here](#proj_doc).
Also notice that R has automatically decided that our `p.names` variable should be a character (`chr`) class variable when we first created the data frame. Whether this is a good idea or not will depend on how you want to use this variable in later analysis. If we decide that this wasn't such a good idea we can change the default behaviour of the `data.frame()` function by including the argument `stringsAsFactors = TRUE`. Now our strings are automatically converted to factors.
```{r dataf4, echo=TRUE, collapse=TRUE}
p.height <- c(180, 155, 160, 167, 181)
p.weight <- c(65, 50, 52, 58, 70)
p.names <- c("Joanna", "Charlotte", "Helen", "Karen", "Amy")
dataf <- data.frame(height = p.height, weight = p.weight, names = p.names,
stringsAsFactors = TRUE)
str(dataf)
```
\
## Importing data
Although creating data frames from existing data structures is extremely useful, by far the most common approach is to create a data frame by importing data from an external file. To do this, you'll need to have your data correctly formatted and saved in a file format that R is able to recognise. Fortunately for us, R is able to recognise a wide variety of file formats, although in reality you'll probably end up only using two or three regularly.
\
```{block2, vid-text9, type='rmdvideo'}
Take a look at this [video][import-vid] for a quick introduction to importing data in R
```
### Saving files to import
The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a tab delimited file. We prefer LibreOffice Calc as it's open source, platform independent and free but MS Excel is OK too (but see [here][excel_gotcha] for some gotchas). Here's the data from the petunia experiment we dicussed previously displayed in LibreOffice. If you want to follow along you can download the data file (*'flower.xls'*) from the companion website [here][flow-data].
\
```{r LO-calc0, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/libre_off.png")
```
\
For those of you unfamiliar with the tab delimited file format it simply means that data in different columns are separated with a 'tab' character (yes, the same one as on your keyboard) and is usually saved as a file with a '.txt' extension (you might also see `.tsv` which is short for tab separated values).
To save a spreadsheet as a tab delimited file in LibreOffice Calc select `File` -> `Save as ...` from the main menu. You will need to specify the location you want to save your file in the 'Save in folder' option and the name of the file in the 'Name' option. In the drop down menu located above the 'Save' button change the default 'All formats' to 'Text CSV (.csv)'.
\
```{r LO-calc, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/libre_off1.png")
```
\
Click the Save button and then select the 'Use Text CSV Format' option. In the next pop-up window select `{Tab}` from the drop down menu in the 'Field delimiter' option. Click on OK to save the file.
\
```{r LO-calc2, echo=FALSE, out.width="40%", fig.align="center"}
knitr::include_graphics(path = "images/libre_off2.png")
```
\
The resulting file will annoyingly have a '.csv' extension even though we've saved it as a tab delimited file. Either live with it or rename the file with a '.txt' extension instead.
In MS Excel, select `File` -> `Save as ...` from the main menu and navigate to the folder where you want to save the file. Enter the file name (keep it fairly short, [no spaces](#file_names)!) in the 'Save as:' dialogue box. In the 'File format:' dialogue box click on the down arrow to open the drop down menu and select 'Text (Tab delimited)' as your file type. Select OK to save the file.
\
```{r ms-excel, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/ms_excel1.png")
```
\
There are a couple of things to bear in mind when saving files to import into R which will make your life easier in the long run. Keep your column headings (if you have them) short and informative. Also avoid spaces in your column headings by replacing them with an underscore or a dot (i.e. replace `shoot height` with `shoot_height` or `shoot.height`) and avoid using special characters (i.e. `leaf area (mm^2)`). Remember, if you have missing data in your data frame (empty cells) you should use an `NA` to represent these missing values. This will keep the data frame tidy.
### Import functions {#import_fnc}
Once you've saved your data file in a suitable format we can now read this file into R. The workhorse function for importing data into R is the `read.table()`\index{read.table()} function (we discuss some alternatives later in the chapter). The `read.table()` function is a very flexible function with a shed load of arguments (see `?read.table`) but it's quite simple to use. Let's import a tab delimited file called `flower.txt` which contains the data we saw previously in this [Chapter](#df) and assign it to an object called `flowers`. The file is located in a `data` directory which itself is located in our [root directory](#dir_struct). The first row of the data contains the variable (column) names. To use the `read.table()` function to import this file.
```{r df1, echo=TRUE, collapse=TRUE}
flowers <- read.table(file = 'data/flower.txt', header = TRUE, sep = "\t",
stringsAsFactors = TRUE)
```
There are a few things to note about the above command. First, the file path and the filename (including the file extension) needs to be enclosed in either single or double quotes (i.e. the `data/flower.txt` bit) as the `read.table()` function expects this to be a character string. If your working directory is already set to the directory which contains the file, you don’t need to include the entire file path just the filename. In the example above, the file path is separated with a single forward slash `/`. This will work regardless of the operating system you are using and we recommend you stick with this. However, Windows users may be more familiar with the single backslash notation and if you want to keep using this you will need to include them as double backslashes. Note though that the double backslash notation will **not** work on computers using Mac OSX or Linux operating systems.
```{r df2, echo=TRUE, eval=FALSE}
flowers <- read.table(file = 'C:\\Documents\\Prog1\\data\\flower.txt',
header = TRUE, sep = "\t", stringsAsFactors = TRUE)
```
The `header = TRUE` argument specifies that the first row of your data contains the variable names (i.e. `nitrogen`, `block` etc). If this is not the case you can specify `header = FALSE` (actually, this is the default value so you can omit this argument entirely). The `sep = "\t"` argument tells R that the file delimiter is a tab (`\t`).
After importing our data into R it doesn't appear that R has done much, at least nothing appears in the R Console! To see the contents of the data frame we could just type the name of the object as we have done previously. **BUT** before you do that, think about why you're doing this. If your data frame is anything other than tiny, all you're going to do is fill up your Console with data. It's not like you can easily check whether there are any errors or that your data has been imported correctly. A much better solution is to use our old friend the `str()` function to return a compact and informative summary of your data frame.
```{r df3, echo=TRUE, collapse=TRUE}
str(flowers)
```
Here we see that `flowers` is a 'data.frame' object which contains 96 rows and 8 variables (columns). Each of the variables are listed along with their data class and the first 10 values. As we mentioned previously in this Chapter, it can be quite convenient to copy and paste this into your R script as a comment block for later reference.
Notice also that your character string variables (`treat` and `nitrogen`) have been imported as factors because we used the argument `stringsAsFactors = TRUE`. If this is not what you want you can prevent this by using the `stringsAsFactors = FALSE` or from R version 4.0.0 you can just leave out this argument as `stringsAsFactors = FALSE` is the default.
```{r df4, echo=TRUE, collapse=TRUE}
flowers <- read.table(file = 'data/flower.txt', header = TRUE,
sep = "\t", stringsAsFactors = FALSE)
str(flowers)
```
Other useful arguments include `dec =` and `na.strings =`. The `dec =` argument allows you to change the default character (`.`) used for a decimal point. This is useful if you're in a country where decimal places are usually represented by a comma (i.e. `dec = ","`). The `na.strings =` argument allows you to import data where missing values are represented with a symbol other than `NA`. This can be quite common if you are importing data from other statistical software such as Minitab which represents missing values as a `*` (`na.strings = "*"`).
If we just wanted to see the names of our variables (columns) in the data frame we can use the `names()` function which will return a character vector of the variable names.
```{r df4.1, echo=TRUE, collapse=TRUE}
names(flowers)
```
R has a number of variants of the `read.table()` function that you can use to import a variety of file formats. Actually, these variants just use the `read.table()` function but include different combinations of arguments by default to help import different file types. The most useful of these are the `read.csv()`\index{read.csv()}, `read.csv2()`\index{read.csv2()} and `read.delim()`\index{read.delim()} functions. The `read.csv()` function is used to import comma separated value (.csv) files and assumes that the data in columns are separated by a comma (it sets `sep = ","` by default). It also assumes that the first row of the data contains the variable names by default (it sets `header = TRUE` by default). The `read.csv2()` function assumes data are separated by semicolons and that a comma is used instead of a decimal point (as in many European countries). The `read.delim()` function is used to import tab delimited data and also assumes that the first row of the data contains the variable names by default.
```{r df5, echo=TRUE, eval=FALSE}
# import .csv file
flowers <- read.csv(file = 'data/flower.csv')
# import .csv file with dec = "," and sep = ";"
flowers <- read.csv2(file = 'data/flower.csv')
# import tab delim file with sep = "\t"
flowers <- read.delim(file = 'data/flower.txt')
```
You can even import spreadsheet files from MS Excel or other statistics software directly into R but our advice is that this should generally be avoided if possible as it just adds a layer of uncertainty between you and your data. In our opinion it's almost always better to export your spreadsheets as tab or comma delimited files and then import them into R using the `read.table()` function. If you're hell bent on directly importing data from other software you will need to install the `foreign` package which has functions for importing Minitab, SPSS, Stata and SAS files or the `xlsx` package to import Excel spreadsheets.
### Common import frustrations
It's quite common to get a bunch of really frustrating error messages when you first start importing data into R. Perhaps the most common is
```{r, eval=FALSE}
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'flower.txt': No such file or directory
```
This error message is telling you that R cannot find the file you are trying to import. It usually rears its head for one of a couple of reasons (or all of them!). The first is that you've made a mistake in the spelling of either the filename or file path. Another common mistake is that you have forgotten to include the file extension in the filename (i.e. `.txt`). Lastly, the file is not where you say it is or you've used an incorrect file path. Using RStudio [Projects](#rsprojs) and having a logical [directory structure](#dir_struct) goes a long way to avoiding these types of errors.
Another really common mistake is to forget to include the `header = TRUE` argument when the first row of the data contains variable names. For example, if we omit this argument when we import our `flowers.txt` file everything looks OK at first (no error message at least)
```{r df6, echo=TRUE, collapse=TRUE}
flowers_bad <- read.table(file = 'data/flower.txt', sep = "\t")
```
but when we take a look at our data frame using `str()`
```{r df7, echo=TRUE, collapse=TRUE}
str(flowers_bad)
```
We can see an obvious problem, all of our variables have been imported as factors and our variables are named `V1`, `V2`, `V3` ... `V8`. The problem happens because we haven't told the `read.table()` function that the first row contains the variable names and so it treats them as data. As soon as we have a single character string in any of our data vectors, R treats the vectors as character type data (remember all elements in a vector must contain the [same type of data](#scal_vecs)).
### Other import options {#import_other}
There are numerous other functions to import data from a variety of sources and formats. Most of these functions are contained in packages that you will need to install before using them. We list a couple of the more useful packages and functions below.
The `fread()`\index{fread} function from the `data.table` package \index{read.table package} is great for importing large data files quickly and efficiently (much faster than the `read.table()` function). One of the great things about the `fread()` function is that it will automatically detect many of the arguments you would normally need to specify (like `sep =` etc). One thing you might need to consider is that the `fread()` function will return a `data.table` object by default not a `data.frame` as would be the case with the `read.table()` function. This is usually not a problem and you can change this default behaviour by using the argument `data.table = FALSE` when you use `fread()` if you prefer a `data.frame` object. To learn more about the differences between `data.table` and `data.frame` objects see [here][data-table].
```{r df8, echo=TRUE, eval=FALSE}
library(data.table)
all_data <- fread(file = 'data/flower.txt')
```
Various functions from the `readr`\index{readr package} package are also very efficient at reading in large data files. The `readr` package is part of the '[tidyverse][tidyverse]' collection of packages and provides many equivalent functions to base R for importing data. The `readr` functions are used in a similar way to the `read.table()` or `read.csv()` functions and many of the arguments are the same (see `?readr::read_table` for more details). There are however some differences. For example, when using the `read_table()`\index{read\_table()} function the `header = TRUE` argument is replaced by `col_names = TRUE` and the function returns a `tibble` class object which is the tidyverse equivalent of a `data.frame` object (see [here][tibbles] for differences). \index{read\_csv()} \index{read\_delim()} \index{read\_tsv()}
```{r df9, echo=TRUE, eval=FALSE}
library(readr)
# import white space delimited files
all_data <- read_table(file = 'data/flower.txt', col_names = TRUE)
# import comma delimited files
all_data <- read_csv(file = 'data/flower.txt')
# import tab delimited files
all_data <- read_delim(file = 'data/flower.txt', delim = "\t")
# or use
all_data <- read_tsv(file = 'data/flower.txt')
```
If your data file is ginormous, then the `ff`\index{ff package} and `bigmemory`\index{bigmemory package} packages may be useful as they both contain import functions that are able to store large data in a memory efficient manner. You can find out more about these functions [here][ff] and [here][bigmem].
## Wrangling data frames
Now that you're able to successfully import your data from an external file into R our next task is to do something useful with our data. Working with data is a fundamental skill which you'll need to develop and get comfortable with as you'll likely do a lot of it during any project. The good news is that R is especially good at manipulating, summarising and visualising data. Manipulating data (often known as data wrangling or munging) in R can at first seem a little daunting for the new user but if you follow a few simple logical rules then you'll quickly get the hang of it, especially with some practice.
\
```{block2, vid-text10, type='rmdvideo'}
See this [video][dataw-vid] for a general overview on how to use positional and logical indexes to extract data from a data frame object in R
```
\
Let's remind ourselves of the structure of the `flowers` data frame we imported in the previous section.
```{r dw1, echo=TRUE, collapse=TRUE}
flowers <- read.table(file = 'data/flower.txt', header = TRUE, sep = "\t")
str(flowers)
```
To access the data in any of the variables (columns) in our data frame we can use the `$` notation. For example, to access the `height` variable in our `flowers` data frame we can use `flowers$height`. This tells R that the `height` variable is contained within the data frame `flowers`.
```{r dw2, echo=TRUE, collapse=TRUE}
flowers$height
```
This will return a vector of the `height` data. If we want we can assign this vector to another object and do stuff with it, like calculate a mean or get a summary of the variable using the `summary()`\index{summary()} function.
```{r dw3, echo=TRUE, collapse=TRUE}
f_height <- flowers$height
mean(f_height)
summary(f_height)
```
Or if we don't want to create an additional object we can use functions 'on-the-fly' to only display the value in the console.
```{r dw3.1, echo=TRUE, collapse=TRUE}
mean(flowers$height)
summary(flowers$height)
```
Just as we did with [vectors](#vectors), we also can access data in data frames using the square bracket `[ ]` notation. However, instead of just using a single index, we now need to use two indexes, one to specify the rows and one for the columns. To do this, we can use the notation `my_data[rows, columns]` where `rows` and `columns` are indexes and `my_data` is the name of the data frame. Again, just like with our vectors our indexes can be positional or the result of a logical test.
### Positional indexes
To use positional indexes we simple have to write the position of the rows and columns we want to extract inside the `[ ]`. For example, if for some reason we wanted to extract the first value (1^st^ row ) of the `height` variable (4^th^ column).
```{r dw4, echo=TRUE, collapse=TRUE}
flowers[1, 4]
# this would give you the same
flowers$height[1]
```
We can also extract values from multiple rows or columns by specifying these indexes as vectors inside the `[ ]`. To extract the first 10 rows and the first 4 columns we simple supply a vector containing a sequence from 1 to 10 for the rows index (`1:10`) and a vector from 1 to 4 for the column index (`1:4`).
```{r dw5, echo=TRUE, collapse=TRUE}
flowers[1:10, 1:4]
```
Or for non sequential rows and columns then we can supply vectors of positions using the `c()` function. To extract the 1^st^, 5^th^, 12^th^, 30^th^ rows from the 1^st^, 3^rd^, 6^th^ and 8^th^ columns.
```{r dw6, echo=TRUE, collapse=TRUE}
flowers[c(1, 5, 12, 30), c(1, 3, 6, 8)]
```
All we are doing in the two examples above is creating vectors of positions for the rows and columns that we want to extract. We have done this by using the skills we developed in [Chapter 2](#funcs) when we generated vectors using the `c()` function or using the `:` notation.
But what if we want to extract either all of the rows or all of the columns? It would be extremely tedious to have to generate vectors for all rows or for all columns. Thankfully R has a shortcut. If you don't specify either a row or column index in the `[ ]` then R interprets it to mean you want all rows or all columns. For example, to extract the first 8 rows and all of the columns in the `flower` data frame
```{r dw7, echo=TRUE, collapse=TRUE}
flowers[1:8, ]
```
or all of the rows and the first 3 columns. If you're reading the web version of this book scroll down in output panel to see all of the data. Note, if you're reading the pdf version of the book some of the output has been truncated to save some space.
```{r dw8, echo=TRUE, eval=FALSE}
flowers[, 1:3]
```
```{r dw8-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
flowers[, 1:3]
```
```{r dw8-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
rbind(head(flowers[, 1:3], n= 10), tail(flowers[, 1:3], n= 10))
```
We can even use negative positional indexes to exclude certain rows and columns. As an example, lets extract all of the rows except the first 85 rows and all columns except the 4^th^, 7^th^ and 8^th^ columns. Notice we need to use `-()` when we generate our row positional vectors. If we had just used `-1:85` this would actually generate a regular sequence from -1 to 85 which is not what we want (we can of course use `-1:-85`).
```{r dw9, echo=TRUE, collapse=TRUE}
flowers[-(1:85), -c(4, 7, 8)]
```
In addition to using a positional index for extracting particular columns (variables) we can also name the variables directly when using the square bracket `[ ]` notation. For example, let's extract the first 5 rows and the variables `treat`, `nitrogen` and `leafarea`. Instead of using `flowers[1:5, c(1, 2, 6)]` we can instead use
```{r dw10, echo=TRUE, collapse=TRUE}
flowers[1:5, c("treat", "nitrogen", "leafarea")]
```
We often use this method in preference to the positional index for selecting columns as it will still give us what we want even if we've changed the order of the columns in our data frame for some reason.
### Logical indexes
Just as we did with vectors, we can also extract data from our data frame based on a logical test. We can use all of the logical operators that we used for our vector examples so if these have slipped your mind maybe [pop back](#Logical-index) and refresh your memory. Let's extract all rows where `height` is greater than 12 and extract all columns by default (remember, if you don't include a column index after the comma it means all columns).
```{r dw11, echo=TRUE, collapse=TRUE}
big_flowers <- flowers[flowers$height > 12, ]
big_flowers
```
Notice in the code above that we need to use the `flowers$height` notation for the logical test. If we just named the `height` variable without the name of the data frame we would receive an error telling us R couldn't find the variable `height`. The reason for this is that the `height` variable only exists inside the `flowers` data frame so you need to tell R exactly where it is.
```{r dw12, echo=TRUE, eval=FALSE}
big_flowers <- flowers[height > 12, ]
Error in `[.data.frame`(flowers, height > 12, ) :
object 'height' not found
```
So how does this work? The logical test is `flowers$height > 12` and R will only extract those rows that satisfy this logical condition. If we look at the output of just the logical condition you can see this returns a vector containing `TRUE` if `height` is greater than 12 and `FALSE` if `height` is not greater than 12.
```{r dw13, echo=TRUE, collapse=TRUE}
flowers$height > 12
```
So our row index is a vector containing either `TRUE` or `FALSE` values and only those rows that are `TRUE` are selected.
Other commonly used operators are shown below.
```{r dw14, echo=TRUE, eval=FALSE}
flowers[flowers$height >= 6, ] # values greater or equal to 6
flowers[flowers$height <= 6, ] # values less than or equal to 6
flowers[flowers$height == 8, ] # values equal to 8
flowers[flowers$height != 8, ] # values not equal to 8
```
We can also extract rows based on the value of a character string or factor level. Let's extract all rows where the `nitrogen` level is equal to `high` (again we will output all columns). Notice that the double equals `==` sign must be used for a logical test and that the character string must be enclosed in either single or double quotes (i.e. `"high"`).
```{r dw15, echo=TRUE,eval=FALSE}
nit_high <- flowers[flowers$nitrogen == "high", ]
nit_high
```
```{r dw15-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
nit_high <- flowers[flowers$nitrogen == "high", ]
nit_high
```
```{r dw15-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
nit_high <- flowers[flowers$nitrogen == "high", ]
rbind(head(nit_high, n = 10), tail(nit_high, n = 10))
```
Or we can extract all rows where `nitrogen` level is not equal to `medium` (using `!=`) and only return columns 1 to 4.
```{r dw16, echo=TRUE, eval=FALSE}
nit_not_medium <- flowers[flowers$nitrogen != "medium", 1:4]
nit_not_medium
```
```{r dw16-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
nit_not_medium <- flowers[flowers$nitrogen != "medium", 1:4]
nit_not_medium
```
```{r dw16-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
nit_not_medium <- flowers[flowers$nitrogen != "medium", 1:4]
rbind(head(nit_not_medium, n = 10), tail(nit_not_medium, n = 10))
```
We can increase the complexity of our logical tests by combining them with [Boolean expressions][boolean] just as we did for vector objects. For example, to extract all rows where `height` is greater or equal to `6` AND `nitrogen` is equal to `medium` AND `treat` is equal to `notip` we combine a series of logical expressions with the `&` symbol.
```{r dw17, echo=TRUE, collapse=TRUE}
low_notip_heigh6 <- flowers[flowers$height >= 6 & flowers$nitrogen == "medium" &
flowers$treat == "notip", ]
low_notip_heigh6
```
To extract rows based on an 'OR' Boolean expression we can use the `|` symbol. Let's extract all rows where `height` is greater than 12.3 OR less than 2.2.
```{r dw17.1, echo=TRUE, collapse=TRUE}
height2.2_12.3 <- flowers[flowers$height > 12.3 | flowers$height < 2.2, ]
height2.2_12.3
```
An alternative method of selecting parts of a data frame based on a logical expression is to use the `subset()`\index{subset()} function instead of the `[ ]`. The advantage of using `subset()` is that you no longer need to use the `$` notation when specifying variables inside the data frame as the first argument to the function is the name of the data frame to be subsetted. The disadvantage is that `subset()` is less flexible than the `[ ]` notation.
```{r dw18, echo=TRUE, collapse=TRUE}
tip_med_2 <- subset(flowers, treat == "tip" & nitrogen == "medium" & block == 2)
tip_med_2
```
And if you only want certain columns you can use the `select =` argument.
```{r dw19, echo=TRUE, collapse=TRUE}
tipplants <- subset(flowers, treat == "tip" & nitrogen == "medium" & block == 2,
select = c("treat", "nitrogen", "leafarea"))
tipplants
```
### Ordering data frames
Remember when we used the function `order()`\index{order()} to order one vector based on the order of another vector (way back in [Chapter 2](#vec_ord)). This comes in very handy if you want to reorder rows in your data frame. For example, if we want all of the rows in the data frame `flowers` to be ordered in ascending value of `height` and output all columns by default. If you're reading this section of the book on the web you can scroll down in the output panels to see the entire ordered data frame. If you're reading the pdf version of the book, note that some of the output from the code chunks has been truncated to save some space.
```{r dw20, echo=TRUE, eval=FALSE}
height_ord <- flowers[order(flowers$height), ]
height_ord
```
```{r dw20-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
height_ord <- flowers[order(flowers$height), ]
height_ord
```
```{r dw20-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
height_ord <- flowers[order(flowers$height), ]
head(height_ord, n = 15)
```
We can also order by descending order of a variable (i.e. `leafarea`) using the `decreasing = TRUE` argument.
```{r dw21, echo=TRUE, eval=FALSE}
leafarea_ord <- flowers[order(flowers$leafarea, decreasing = TRUE), ]
leafarea_ord
```
```{r dw21-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
leafarea_ord <- flowers[order(flowers$leafarea, decreasing = TRUE), ]
leafarea_ord
```
```{r dw21-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
leafarea_ord <- flowers[order(flowers$leafarea, decreasing = TRUE), ]
head(leafarea_ord, n = 15)
```
We can even order data frames based on multiple variables. For example, to order the data frame `flowers` in ascending order of both `block` and `height`.
```{r dw22, echo=TRUE, eval=FALSE}
block_height_ord <- flowers[order(flowers$block, flowers$height), ]
block_height_ord
```
```{r dw22-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
block_height_ord <- flowers[order(flowers$block, flowers$height), ]
block_height_ord
```
```{r dw22-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
block_height_ord <- flowers[order(flowers$block, flowers$height), ]
head(block_height_ord, n = 20)
```
What if we wanted to order `flowers` by ascending order of `block` but descending order of `height`? We can use a simple trick by adding a `-` symbol before the `flowers$height` variable when we use the `order()` function. This will essentially turn all of the `height` values negative which will result in reversing the order. Note, that this trick will only work with numeric variables.
```{r dw23, echo=TRUE, eval=FALSE}
block_revheight_ord <- flowers[order(flowers$block, -flowers$height), ]
block_revheight_ord
```
```{r dw23-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
block_revheight_ord <- flowers[order(flowers$block, -flowers$height), ]
block_revheight_ord
```
```{r dw23-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
block_revheight_ord <- flowers[order(flowers$block, -flowers$height), ]
rbind(head(block_revheight_ord, n = 15), tail(block_revheight_ord, n = 15))
```
If we wanted to do the same thing with a factor (or character) variable like `nitrogen` we would need to use the function `xtfrm()`\index{xtfrm} for this variable inside our `order()` function.
```{r dw24, echo=TRUE, eval=FALSE}
block_revheight_ord <- flowers[order(-xtfrm(flowers$nitrogen), flowers$height), ]
block_revheight_ord
```
```{r dw24-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
block_revheight_ord <- flowers[order(-xtfrm(flowers$nitrogen), flowers$height), ]
block_revheight_ord
```
```{r dw24-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
block_revheight_ord <- flowers[order(-xtfrm(flowers$nitrogen), flowers$height), ]
rbind(head(block_revheight_ord, n = 15), tail(block_revheight_ord, n = 15))
```
Notice that the `nitrogen` variable has been reverse ordered alphabetically and `height` has been ordered by increasing values within each level of `nitrogen`.
If we wanted to order the data frame by `nitrogen` but this time order it from `low` -> `medium` -> `high` instead of the default alphabetically (`high`, `low`, `medium`), we need to first change the order of our levels of the `nitrogen` factor in our data frame using the `factor()`\index{factor()} function. Once we've done this we can then use the `order()` function as usual. Note, if you're reading the pdf version of this book, the output has been truncated to save space.
```{r dw24a, echo=TRUE, eval=FALSE}
flowers$nitrogen <- factor(flowers$nitrogen,
levels = c("low", "medium", "high"))
nit_ord <- flowers[order(flowers$nitrogen),]
nit_ord
```
```{r dw24a-html, echo=FALSE, collapse=TRUE, eval=knitr::is_html_output(), warning=FALSE, attr.output='style="max-height: 500px;"'}
flowers$nitrogen <- factor(flowers$nitrogen,
levels = c("low", "medium", "high"))
nit_ord <- flowers[order(flowers$nitrogen),]
nit_ord
```
```{r dw24a-latex, echo=FALSE, collapse=TRUE, eval=knitr::is_latex_output(), warning=FALSE}
flowers$nitrogen <- factor(flowers$nitrogen,
levels = c("low", "medium", "high"))
nit_ord <- flowers[order(flowers$nitrogen),]
rbind(head(nit_ord, n = 15), tail(nit_ord, n = 15))
```
### Adding columns and rows
Sometimes it's useful to be able to add extra rows and columns of data to our data frames. There are multiple ways to achieve this (as there always is in R!) depending on your circumstances. To simply append additional rows to an existing data frame we can use the `rbind()`\index{rbind()} function and to append columns the `cbind()`\index{cbind()} function. Let's create a couple of test data frames to see this in action using our old friend the `data.frame()` function.
```{r dw25, echo=TRUE, collapse=TRUE}
# rbind for rows
df1 <- data.frame(id = 1:4, height = c(120, 150, 132, 122),
weight = c(44, 56, 49, 45))
df1
df2 <- data.frame(id = 5:6, height = c(119, 110),
weight = c(39, 35))
df2
df3 <- data.frame(id = 1:4, height = c(120, 150, 132, 122),
weight = c(44, 56, 49, 45))
df3
df4 <- data.frame(location = c("UK", "CZ", "CZ", "UK"))
df4
```
We can use the `rbind()` function to append the rows of data in `df2` to the rows in `df1` and assign the new data frame to `df_rcomb`.
```{r dw25.1, echo=TRUE, collapse=TRUE}
df_rcomb <- rbind(df1, df2)
df_rcomb
```
And `cbind` to append the column in `df4` to the `df3` data frame and assign to `df_ccomb`.
```{r dw25.2, echo=TRUE, collapse=TRUE}
df_ccomb <- cbind(df3, df4)
df_ccomb
```
Another situation when adding a new column to a data frame is useful is when you want to perform some kind of transformation on an existing variable. For example, say we wanted to apply a log~10~ transformation on the height variable in the `df_rcomb` data frame we created above. We could just create a separate variable to contains these values but it's good practice to create this variable as a new column inside our existing data frame so we keep all of our data together. Let's call this new variable `height_log10`.
```{r dw26, echo=TRUE, collapse=TRUE}
# log10 transformation
df_rcomb$height_log10 <- log10(df_rcomb$height)
df_rcomb
```
This situation also crops up when we want to convert an existing variable in a data frame from one data class to another data class. For example, the `id` variable in the `df_rcomb` data frame is numeric type data (use the `str()` or `class()` functions to check for yourself). If we wanted to convert the `id` variable to a factor to use later in our analysis we can create a new variable called `Fid` in our data frame and use the `factor()`\index{factor()} function to convert the `id` variable.
```{r dw27, echo=TRUE, collapse=TRUE}
# convert to a factor
df_rcomb$Fid <- factor(df_rcomb$id)
df_rcomb
str(df_rcomb)
```
### Merging data frames
Instead of just appending either rows or columns to a data frame we can also merge two data frames together. Let's say we have one data frame that contains taxonomic information on some common UK rocky shore invertebrates (called `taxa`) and another data frame that contains information on where they are usually found on the rocky shore (called `zone`). We can merge these two data frames together to produce a single data frame with both taxonomic and location information. Let's first create both of these data frames (in reality you would probably just import your different datasets).
```{r dw28, echo=TRUE, collapse=TRUE}
taxa <- data.frame(GENUS = c("Patella", "Littorina", "Halichondria", "Semibalanus"),
species = c("vulgata", "littoria", "panacea", "balanoides"),
family = c("patellidae", "Littorinidae", "Halichondriidae", "Archaeobalanidae"))
taxa
zone <- data.frame(genus = c("Laminaria", "Halichondria", "Xanthoria", "Littorina",
"Semibalanus", "Fucus"),
species = c("digitata", "panacea", "parietina", "littoria",
"balanoides", "serratus"),
zone = c( "v_low", "low", "v_high", "low_mid", "high", "low_mid"))
zone
```
Because both of our data frames contains at least one variable in common (`species` in our case) we can simply use the `merge()`\index{merge()} function to create a new data frame called `taxa_zone`.
```{r dw29, echo=TRUE, collapse=TRUE}
taxa_zone <- merge(x = taxa, y = zone)
taxa_zone
```
Notice that the merged data frame contains only the rows that have `species` information in **both** data frames. There are also two columns called `GENUS` and `genus` because the `merge()` function treats these as two different variables that originate from the two data frames.
If we want to include all data from both data frames then we will need to use the `all = TRUE` argument. The missing values will be included as `NA`.
```{r dw30, echo=TRUE, collapse=TRUE}
taxa_zone <- merge(x = taxa, y = zone, all = TRUE)
taxa_zone
```
If the variable names that you want to base the merge on are different in each data frame (for example `GENUS` and `genus`) you can specify the names in the first data frame (known as `x`) and the second data frame (known as `y`) using the `by.x =` and `by.y =` arguments.
```{r dw31, echo=TRUE, collapse=TRUE}
taxa_zone <- merge(x = taxa, y = zone, by.x = "GENUS", by.y = "genus", all = TRUE)
taxa_zone
```
Or using multiple variable names.
```{r dw31.1, echo=TRUE, collapse=TRUE}
taxa_zone <- merge(x = taxa, y = zone, by.x = c("species", "GENUS"),
by.y = c("species", "genus"), all = TRUE)
taxa_zone
```
### Reshaping data frames {#reshape}
Reshaping data into different formats is a common task. With rectangular type data (data frames have the same number of rows in each column) there are two main data frame shapes that you will come across: the 'long' format (sometimes called stacked) and the 'wide' format. An example of a long format data frame is given below. We can see that each row is a single observation from an individual subject and each subject can have multiple rows. This results in a single column of our `measurement`.
```{r dw32, echo=TRUE, collapse=TRUE}
long_data <- data.frame(
subject = rep(c("A", "B", "C", "D"), each = 3),
sex = rep(c("M", "F", "F", "M"), each =3),
condition = rep(c("control", "cond1", "cond2"), times = 4),
measurement = c(12.9, 14.2, 8.7, 5.2, 12.6, 10.1, 8.9,
12.1, 14.2, 10.5, 12.9, 11.9))
long_data
```
\
We can also format the same data in the wide format as shown below. In this format we have multiple observations from each subject in a single row with measurements in different columns (`control`, `cond1` and `cond2`). This is a common format when you have repeated measurements from sampling units.
```{r dw34, echo=TRUE, collapse=TRUE}
wide_data <- data.frame(subject = c("A", "B", "C", "D"),
sex = c("M", "F", "F", "M"),
control = c(12.9, 5.2, 8.9, 10.5),
cond1 = c(14.2, 12.6, 12.1, 12.9),
cond2 = c(8.7, 10.1, 14.2, 11.9))
wide_data
```
\
Whilst there's no inherent problem with either of these formats we will sometimes need to convert between the two because some functions will require a specific format for them to work. The most common format is the long format.
There are many ways to convert between these two formats but we'll use the `melt()`\index{melt()} and `dcast()`\index{dcast()} functions from the `reshape2`\index{reshape2 package} package (you will need to install this package first). The `melt()` function is used to convert from wide to long formats. The first argument for the `melt()` function is the data frame we want to melt (in our case `wide_data`). The `id.vars = c("subject", "sex")` argument is a vector of the variables you want to stack, the `measured.vars = c("control", "cond1", "cond2")` argument identifies the columns of the measurements in different conditions, the `variable.name = "condition"` argument specifies what you want to call the stacked column of your different conditions in your output data frame and `value.name = "measurement"` is the name of the column of your stacked measurements in your output data frame.
```{r dw36, echo=TRUE, collapse=TRUE}
library(reshape2)
wide_data # remind ourselves what the wide format looks like
# convert wide to long
my_long_df <- melt(data = wide_data, id.vars = c("subject", "sex"),
measured.vars = c("control", "cond1", "cond2"),
variable.name = "condition", value.name = "measurement")
my_long_df
```
The `dcast()` function is used to convert from a long format data frame to a wide format data frame. The first argument is again is the data frame we want to cast (`long_data` for this example). The second argument is in formula syntax. The `subject + sex` bit of the formula means that we want to keep these columns separate, and the `~ condition` part is the column that contains the labels that we want to split into new columns in our new data frame. The `value.var = "measurement"` argument is the column that contains the measured data.
```{r dw37, echo=TRUE, collapse=TRUE}
long_data # remind ourselves what the long format look like
# convert long to wide
my_wide_df <- dcast(data = long_data, subject + sex ~ condition,
value.var = "measurement")
my_wide_df
```
## Summarising data frames
Now that we're able to manipulate and extract data from our data frames our next task is to start exploring and getting to know our data. In this section we'll start producing tables of useful summary statistics of the variables in our data frame and in the next two Chapters we'll cover visualising our data with base R graphics and using the `ggplot2` package.
A really useful starting point is to produce some simple summary statistics of all of the variables in our `flowers` data frame using the `summary()`\index{summary()} function.
```{r sum1, echo=TRUE, collapse=TRUE}
summary(flowers)
```
For numeric variables (i.e. `height`, `weight` etc) the mean, minimum, maximum, median, first (lower) quartile and third (upper) quartile are presented. For factor variables (i.e. `treat` and `nitrogen`) the number of observations in each of the factor levels is given. If a variable contains missing data then the number of `NA` values is also reported.
If we wanted to summarise a smaller subset of variables in our data frame we can use our indexing skills in combination with the `summary()` function. For example, to summarise only the `height`, `weight`, `leafarea` and `shootarea` variables we can include the appropriate column indexes when using the `[ ]`. Notice we include all rows by not specifying a row index.
```{r sum2, echo=TRUE, collapse=TRUE}
summary(flowers[, 4:7])
# or equivalently
# summary(flowers[, c("height", "weight", "leafarea", "shootarea")])
```
And to summarise a single variable.
```{r sum3, echo=TRUE, collapse=TRUE}
summary(flowers$leafarea)
# or equivalently
# summary(flowers[, 6])
```
As you've seen above, the `summary()` function reports the number of observations in each level of our factor variables. Another useful function for generating tables of counts is the `table()`\index{table()} function. The `table()` function can be used to build contingency tables of different combinations of factor levels. For example, to count the number of observations for each level of `nitrogen`
```{r sum4, echo=TRUE, collapse=TRUE}
table(flowers$nitrogen)
```
We can extend this further by producing a table of counts for each combination of `nitrogen` and `treat` factor levels.
```{r sum5, echo=TRUE, collapse=TRUE}
table(flowers$nitrogen, flowers$treat)
```
A more flexible version of the `table()` function is the `xtabs()`\index{xtabs()} function. The `xtabs()` function uses a formula notation (`~`) to build contingency tables with the cross-classifying variables separated by a `+` symbol on the right hand side of the formula. `xtabs()` also has a useful `data =` argument so you don't have to include the data frame name when specifying each variable.
```{r sum6, echo=TRUE, collapse=TRUE}
xtabs(~ nitrogen + treat, data = flowers)
```
We can even build more complicated contingency tables using more variables. Note, in the example below the `xtabs()` function has quietly coerced our `block` variable to a factor.
```{r sum7, echo=TRUE, collapse=TRUE}
xtabs(~ nitrogen + treat + block, data = flowers)
```
And for a nicer formatted table we can nest the `xtabs()` function inside the `ftable()`\index{ftable()} function to 'flatten' the table.
```{r sum8, echo=TRUE, collapse=TRUE}
ftable(xtabs(~ nitrogen + treat + block, data = flowers))
```
We can also summarise our data for each level of a factor variable. Let's say we want to calculate the mean value of `height` for each of our `low`, `meadium` and `high` levels of `nitrogen`. To do this we will use the `mean()`\index{mean()} function and apply this to the `height` variable for each level of `nitrogen` using the `tapply()`\index{tapply()} function.
```{r sum9, echo=TRUE, collapse=TRUE}
tapply(flowers$height, flowers$nitrogen, mean)
```
The `tapply()` function is not just restricted to calculating mean values, you can use it to apply many of the functions that come with R or even functions you've written yourself (see [Chapter 7](#prog_r) for more details). For example, we can apply the `sd()`\index{sd()} function to calculate the standard deviation for each level of `nitrogen` or even the `summary()`\index{summary()} function.
```{r sum10, echo=TRUE, collapse=TRUE}
tapply(flowers$height, flowers$nitrogen, sd)
tapply(flowers$height, flowers$nitrogen, summary)
```
Note, if the variable you want to summarise contains missing values (`NA`) you will also need to include an argument specifying how you want the function to deal with the `NA` values. We saw an example if this in [Chapter 2](#na_vals) where the `mean()`\index{mean()} function returned an `NA` when we had missing data. To include the `na.rm = TRUE` argument we simply add this as another argument when using `tapply()`.
```{r sum11, echo=TRUE, collapse=TRUE}
tapply(flowers$height, flowers$nitrogen, mean, na.rm = TRUE)
```
We can also use `tapply()` to apply functions to more than one factor. The only thing to remember is that the factors need to be supplied to the `tapply()` function in the form of a list using the `list()`\index{list()} function. To calculate the mean `height` for each combination of `nitrogen` and `treat` factor levels we can use the `list(flowers$nitrogen, flowers$treat)` notation.
```{r sum12, echo=TRUE, collapse=TRUE}
tapply(flowers$height, list(flowers$nitrogen, flowers$treat), mean)
```
And if you get a little fed up with having to write `flowers$` for every variable you can nest the `tapply()` function inside the `with()`\index{with()} function. The `with()` function allows R to evaluate an R expression with respect to a named data object (in this case `flowers`).
```{r sum13, echo=TRUE, collapse=TRUE}
with(flowers, tapply(height, list(nitrogen, treat), mean))
```
The `with()` function also works with many other functions and can save you alot of typing!
Another really useful function for summarising data is the `aggregate()`\index{aggregate()} function. The `aggregate()` function works in a very similar way to `tapply()` but is a bit more flexible.
For example, to calculate the mean of the variables `height`, `weight`, `leafarea` and `shootarea` for each level of `nitrogen`.
```{r sum14, echo=TRUE, collapse=TRUE}
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen), FUN = mean)
```
In the code above we have indexed the columns we want to summarise in the `flowers` data frame using `flowers[, 4:7]`. The `by =` argument specifies a list of factors (`list(nitrogen = flowers$nitrogen)`) and the `FUN =` argument names the function to apply (`mean` in this example).
Similar to the `tapply()` function we can include more than one factor to apply a function to. Here we calculate the mean values for each combination of `nitrogen` and `treat`
```{r sum15, echo=TRUE, collapse=TRUE}
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen,
treat = flowers$treat), FUN = mean)
```