-
Notifications
You must be signed in to change notification settings - Fork 42
/
08-importing-data.qmd
220 lines (136 loc) · 7.39 KB
/
08-importing-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# Importing data
## R base functions
We include example data files for practice in the __dslabs__ package. They are stored here:
```{r}
dir <- system.file("extdata", package = "dslabs")
```
Take a look:
```{r}
list.files(dir)
```
Copy one of them to your working directory:
```{r}
file_path <- file.path(dir, "murders.csv")
file.copy(file_path, "murders.csv")
```
The `file.path` function combines characters to form a complete path, ensuring compatibility with the respective operating system. Linux and Mac use forward slashes `/`, while Windows uses backslashes `\`, to separate directories. This function is useful because often you want to define paths using a variable.
The `file.copy` function copies a file and returns `TRUE` if succesful. If the file exists it will not copy.
What kind of file is it? Although the suffix usually tells us what type of file it is, there is no guarantee that these always match.
```{r}
readLines("murders.csv", n = 3)
```
It is comma delimited and has a header. You can import it like this:
```{r}
dat <- read.csv("murders.csv")
```
There are other importing function in base R: `read.table`, `read.csv` and `read.delim`, for example.
## The readr and readxl packages
Tidyverse has improved versions of functions for importing data.
### readr
The __readr__ package includes functions for reading data stored in text file spreadsheets into R. __readr__ is part of the __tidyverse__ package, but you can load it directly using:
```{r, warning=FALSE, message=FALSE}
library(readr)
```
The following functions are available to read-in spreadsheets:
| Function | Format | Typical suffix |
|-----------|--------------------------------------------------|----------------|
| read_table| white space separated values | txt |
| read_csv | comma separated values | csv |
| read_csv2 | semicolon separated values | csv |
| read_tsv | tab delimited separated values | tsv |
| read_delim | general text file format, must define delimiter | txt |
the __readr__ equivalent of `readLines` is `read_lines`:
```{r}
read_lines("murders.csv", n_max = 3)
```
From the .csv suffix and the peek at the file, we know to use `read_csv`:
```{r}
dat <- read_csv("murders.csv")
```
Note that we receive a message letting us know what data types were used for each column. Also note that `dat` is a `tibble`, not just a data frame. This is because `read_csv` is a __tidyverse__ parser.
A powerful added feature of `read_csv` is the `col_type` arguments that let's you specify the data type of each column before reading. This can help with parsing dates or not letting an error like a letter in a column of numbers turn everything into a character.
### readxl
Many spreadsheets are saved in Microsoft Excel format. For this we use parsers in the __readxl__ package:
```{r}
library(readxl)
```
The package provides functions to read-in Microsoft Excel formats:
| Function | Format | Typical suffix |
|-----------|--------------------------------------------------|----------------|
| read_excel | auto detect the format | xls, xlsx|
| read_xls | original format | xls |
| read_xlsx | new format | xlsx |
The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as _sheets_. The functions listed above read the first sheet by default, but we can also read the others. The `excel_sheets` function gives us the names of all the sheets in an Excel file. These names can then be passed to the `sheet` argument in the three functions above to read sheets other than the first.
## Downloading files
A common place for data to reside is on the internet. When these data are in files, we can download them and then import them or even read them directly from the web.
```{r}
url <-
"https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"
```
The `read_csv` file can read these files directly:
```{r, message = FALSE}
dat <- read_csv(url)
```
You can also download the file first using `download.file` or the Unix commands `curl` or `wget`.
```{r}
tmp_filename <- tempfile()
download.file(url, tmp_filename)
dat <- read_csv(tmp_filename)
file.remove(tmp_filename)
```
## Encoding
RStudio assumes the Unicode encoding. A common pitfall in data analysis is assuming a file is Unicode when, in fact, it is something else.
To understand encoding, remember that everything on a computer needs to eventually be converted to 0s and 1s. ASCII is an _encoding_ that maps characters to numbers. ASCII uses 7 bits (0s and 1s) which results in $2^7 = 128$ unique items, enough to encode all the characters on an English language keyboard. However, other languages use characters not included in this encoding. For example, the é in México is not encoded by ASCII. For this reason, a new encoding, using more than 7 bits, was defined: Unicode. When using Unicode, one can chose between 8, 16, and 32 bits abbreviated UTF-8, UTF-16, and UTF-32 respectively. RStudio defaults to UTF-8 encoding. ASCII is a subset of UTF-8.
Try reading in this file:
```{r}
url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/calificaciones.csv"
readLines(url, n = 2)
```
When you see these weird characters the problem is almost always that you are assuming the wrong encoding. You need to be a hacker to figure out, __readr__ has a function that tries:
```{r}
guess_encoding(url)
```
The first guess makes sense as Spanish is often saved using `Latin-1` encoding, also known as `ISO-8859` encoding because it was the first to include accents and other characters used in Spanish. Once we figure this out we can read in the file correctly:
```{r}
read_csv(url, locale = locale(encoding = "ISO-8859-1", decimal_mark = ","))
```
```{r}
#| echo: false
file.remove("murders.csv")
```
## Exercises
(@) Use the `read_csv` function to read each of the csv files that the following code saves in the `files` object. Hint: use the `pattern` in `list.files` to keep only the csv files.
```{r}
library(readr)
path <- system.file("extdata", package = "dslabs")
files <- list.files(path, pattern = ".csv")
res <- lapply(files, function(fn)
read_csv(file.path(path, fn), show_col_types = FALSE))
```
(@) Note that you get a warning. To see which one you can run it one-by-one in a loop:
```{r}
for (i in seq_along(files)) {
print(files[i])
read_csv(file.path(path, files[i]), show_col_types = FALSE)
}
```
`olive.csv` gives us a `New names` warning. This is because the first line of the file is missing the header for the first column.
```{r}
read_lines(file.path(path, "olive.csv"), n_max = 2)
```
Read the help file for `read_csv` to figure out how to read in the file without reading this header. If you skip the header, you should not get this warning. Save the result to an object called `dat`.
```{r}
read_csv(file.path(path, "olive.csv"), col_names = FALSE, skip = 1)
```
(@) A problem with the previous approach is that we don't know what the columns represent. Type `names(dat)` to see that the names are not informative.
Use the `read_lines ` with argument `n_max=1` to read just the first line.
```{r}
read_lines(file.path(path, "olive.csv"), n_max = 1)
```
Notice that you can use this to assign names to the data frame.
```{r}
colnames <- read_lines(file.path(path, "olive.csv"), n_max = 1)
colnames <- strsplit(colnames, ",") |> unlist()
colnames[1] <- "row_number"
names(dat) <- colnames
```