-
Notifications
You must be signed in to change notification settings - Fork 1
/
05_readr.Rmd
154 lines (120 loc) · 4.55 KB
/
05_readr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "05_readr"
output: html_document
---
class: middle, center, inverse
layout: false
# 4.3 `readr`:<br><br>Read Rectangular Text Data
???
- not data in the form of texts, but as stored in a text file (txt, csv, excel file)
---
background-image: url(https://raw.githubusercontent.com/tidyverse/readr/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true
---
## 4.3 `readr`: Read Rectangular Text Data
`readr` provides read and write functions for multiple different file formats:
- `read_delim()`: general delimited files
- `read_csv()`: comma separated files
- `read_csv2()`: semicolon separated files
- `read_tsv()`: tab separated files
- `read_fwf()`: fixed width files
- `read_table()`: white-space separated files
- `read_log()`: web log files
Conveniently, the `write_*()` functions work analog. In addition, use the `readxl` package for Excel files, the `haven` package for Stata files, the `googlesheets4` package for Google Sheets or the `rvest` package for HTML files.
.footnote[
*Note: In most European countries Microsoft Excel is using `;` as the common delimiter, which can be accounted for by leveraging the `read_csv2()` function.*
]
???
- `read_delim()` as a generalization of the other functions
- `rvest` as the go-to package in the context of web scraping with `R`
---
## 4.3 `readr`: Read Rectangular Text Data
Let's try it out by reading in the penguins data. For the purpose of illustrating the `readr` package, the `penguins` data is written to a csv-file a priori using `write_csv(penguins, file = "./data/penguins.csv")`.
.panelset[
.panel[.panel-name[Base Case]
```{r}
data <- read_csv(file = "./data/penguins.csv")
```
]
.panel[.panel-name[Select Columns]
```{r}
data <- read_csv(file = "./data/penguins.csv", col_select = c(species, island))
```
]
.panel[.panel-name[Name Columns]
```{r}
data <- read_csv(file = "./data/penguins.csv", col_names = paste("Var", 1:8, sep = "_"))
```
]
.panel[.panel-name[Skip Rows]
```{r}
data <- read_csv(file = "./data/penguins.csv", skip = 5)
```
]
]
.pull-right[.pull-right[
.footnote[
<i>Note: The output of any `read_*()` function is a `tibble` object.</i>
]]]
---
## 4.3 `readr`: Read Rectangular Text Data
`readr` prints the column specifications after importing. By default, it tries to infer the column type (e.g., `int`, `dbl`, `chr`, `fct`, `date`, `lgl`) from the first 1,000 rows and parses the columns accordingly.
Try to make column specifications explicit! You likely get more familiar with your data and see warnings if something changes unexpectedly.
.panelset[
.panel[.panel-name[Option 1]
```{r, eval=F}
read_csv(
file = "./data/penguins.csv",
col_types = cols(
species = col_character(),
year = col_datetime(format = "%Y"),
island = col_skip())
)
```
]
.panel[.panel-name[Option 2]
```{r, eval=F}
read_csv(
file = "./data/penguins.csv",
col_types = "_f?di" # skip, factor, guess, double, integer, ...
)
```
]
]
Parsing only the first 1,000 rows is efficient but can lead to erroneous guesses:
```{r, eval=F}
read_csv(file = "./data/penguins.csv", guess_max = 2000)
```
.footnote[
*Note: Find more information and functions on the `readr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf).*
]
???
- Hint: sometimes you may have trouble when reading in text data (type character): special signs such as ö, ä or ü may be strangely encoded as cryptic symbols -> in those cases you must control for the encoding of your data in the read_csv function (e.g., UTF-8)
---
## 4.3 `readr`: Read Rectangular Text Data
.pull-left[
Eventually, you would want to cease using `.xlsx` and `.csv` files as they are not capable of reliably storing your metadata (e.g., data types).
```{r, echo=F, fig.align='center', out.height='60%', out.width='60%'}
knitr::include_graphics("./img/excel.jpg")
```
]
--
.pull-right[
`write_rds()` and `read_rds()` provide a nice alternative for [serializing](https://en.wikipedia.org/wiki/Serialization) your `R` objects (e.g., `tibbles`, models) and storing them as `.rds` files.
```{r}
penguins %>%
write_rds(file = "./data/penguins.rds")
```
```{r}
penguins <- read_rds(file = "./data/penguins.rds")
```
<br>
Note that
- `write_rds()` can only be used to save one object at a time,
- a loaded `.rds` file must be stored into a new variable, i.e. given a new name,
- `read_rds()` preserves data types!
]
???
- serialization: the process of translating a data structure or object state into a format that can be stored, transmitted and reconstructed later (possibly in a different computer environment).