-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-importing.Rmd
executable file
·281 lines (210 loc) · 7.27 KB
/
01-importing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
title: 'Week 03: Examples for Data Importing'
---
# Load packages
```{r, warning=F, message=F}
library(tidyverse)
library(readr)
library(readxl)
```
# Let's focus on readr package first
# Check the current directory
```{r}
getwd()
```
```{r}
cth_data <- read_csv("data/CTH_raw_na.csv")
```
# Visit Environmental Performance Index at https://epi.yale.edu/
# The Environmental Performance Index (EPI) ranks 180 countries on 32 performance indicators across 11 issue
# categories covering environmental health and ecosystem vitality.
# Specifically, environmental performance indicators (EPIs) examine environmental issues such as pollution, biodiversity, climate, # energy, erosion, ecosystem services, environmental education, and many others.
#####This year, Denmark ranks #1 in the EPI#####
## https://epi.yale.edu/downloads/epipolicymakersummaryr11.pdf
# Go to "Downloads > EPI2020Raw data"
# https://epi.yale.edu/downloads/epi2020rawdata20200604v02.zip
# Download this file to your Desktop and unzip this file.
# Focus on CTH_raw_na file and move this file to data folder under your working directory
```{r}
# move the file to working directory
#file.copy("/Users/gulinan/Desktop/epi2020rawdata20200604v02/CTH_raw_na.csv", "data/CTH_raw_na.csv")
# When your working directory is the path provided by
# getwd() function, but you want to access the data folder and
# list all the files in it:
list.files(file.path(getwd(), "data"))
```
# CTH_raw_na is a csv file
# Have a look at the first three lines of the CTH data
```{r}
# read_lines is a function from readr package
# ?read_lines
# read_lines() reads up to n_max lines from a file.
read_lines("data/CTH_raw_na.csv", n_max=3)
```
# The first line is header
# read_* functions all assume that col_names=T by default.
# read_* functions always skip empty rows through ski_empty_rows=T by default
```{r}
# more info
?read_csv
```
```{r}
# Read the data into R console
cth_data <- read_csv("data/CTH_raw_na.csv")
```
# When everything is OK, you will see your data under Environement section
```{r}
# Investigate data carefully
View(cth_data)
```
```{r}
# Check the types of variables: character, numeric, integer
str(cth_data)
```
# Note: NA stands for missing values in the data
```{r}
# Check the class of the data
class(cth_data)
```
```{r}
# Check the dimensions
dim(cth_data)
```
```{r}
# Check the column names
colnames(cth_data)
```
```{r}
# Not comfortable with column names?
colnames(cth_data)[-seq(1:3)] <- paste(seq(1950,2014), sep = "")
```
```{r}
# Check the column names one more time!
colnames(cth_data)
```
# Since in readr function path is defined as:
# "Either a path to a file, a connection",
# we can also import csv files into R as well.
# Let's import a csv file on the internet into R now
# Visit Google' Covid-19 mobility report
# at https://www.google.com/covid19/mobility/
# Global data is available to be downloaded as a csv file
# at the following url:
# https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=c050b74b9ee831a7
```{r}
# First specify the url address of the data
url <- "https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=c050b74b9ee831a7"
# Then read it into R (It takes a bit time)
google_mobility <- read_csv(url)
```
```{r}
# Investigate it as you wish
View(google_mobility)
```
```{r}
# or download this file into your local computer
# download.file functions is under utils library
download.file(url, "data/google_mobility.csv")
```
# Let's focus on readxl package
# readxl does not come with tidyverse.
# For that reason, install readxl package and then load it
```{r}
# install.packages("readxl")
library(readxl)
```
# Visit at the web site of Monitoring the situation of
# children and women in Europe and Central Asia
# at http://transmonee.org/
# Go to http://transmonee.org/database/ and
# Download Excel file named
# "Population at the beginning of the year by sex and selected age groups"
# into to your working directory and rename it as Population-1989-2019
# since the file name is very long.
```{r}
# List the sheet names
excel_sheets("data/Population-1989-2019.xlsx")
```
# read_excel() reads both xls and xlsx files and detects the format from the extension.
```{r}
pop <- read_excel("data/Population-1989-2019.xlsx", range= cell_rows(5:40))
```
```{r}
View(pop)
```
# The excel file is very crowded. It consists of many tables and notes in text format
# Let's focus on Total population on January 1 which is located between 5th and
# 40th rows. Note that the 5th row is a heading.
```{r}
# By default, it prints the first sheet
pop_red <- read_excel("data/Population-1989-2019.xlsx", range = cell_rows(5:40))
View(pop_red)
```
```{r}
colnames(pop)[1:2] <- c("Country", "Number")
```
```{r}
colnames(pop)[-(1:2)] <- paste("y", 1989:2019, sep="")
```
```{r}
View(pop)
```
```{r}
colnames(pop_red)[1:2] <- c("Country", "Number")
View(pop_red)
```
# Now it seems a bit OK! Since new data set still involves
# a few empty rows!..We can get rid of these rows
# when we are familiar with dplyr package.
# Maybe, it is better to get more help from
# https://readxl.tidyverse.org/articles/sheet-geometry.html
# https://readxl.tidyverse.org/articles/cell-and-column-types.html
# Sometimes, the Excel file may involve multiple sheets
# Go to http://transmonee.org/database/download/ and
# Download the Excel file of TransMonEE full database for 2019 and
# save the file into your working directory. Do this only once.
```{r}
url <- "http://transmonee.org/wp-content/uploads/2016/05/TM-2019-EN-June.xlsx"
download.file(url, "data/TM-2019-EN-June.xlsx")
```
# Check the number of sheets
```{r}
excel_sheets("data/TM-2019-EN-June.xlsx")
```
# It consists of 6 sheets
```{r}
full_data <- read_excel("data/TM-2019-EN-June.xlsx")
View(full_data)
```
# By default, it prints the first sheet
```{r}
juvenile_data <- read_excel("data/TM-2019-EN-June.xlsx", sheet = "5. Juvenile Justice & Crime")
View(juvenile_data)
```
# More work has to be done to retrieve this data!
# Any idea and suggestion may contribute to the R Community!
################## Study at home ###########################################################################
# Iterating over multiple files or multiple
# sheets can be done via purr package.
# https://readxl.tidyverse.org/articles/articles/readxl-workflows.html
# If you have any solutions, share it on Twitter with #mat381e #rstats and #unicef hashtags!..
# https://rviews.rstudio.com/2019/10/09/building-interactive-world-maps-in-shiny/
# Sometimes, your data may be in a Google SpreadSheet
# Such as the gapminder data at
# https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077
# More info can be found at
# https://moderndive.com/2-viz.html
# Then you can use googlesheets4 package to dowload
# this data into your working directory
```{r}
#install.packages("googlesheets4")
library(googlesheets4)
url <- "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
gapmind <- read_sheet(url)
```
# On the use of this package,
# it is better to read this
# https://www.tidyverse.org/google_privacy_policy/
# Or you can just download the Google Spreadsheet as a csv or xls file and then
# read it via read_csv or read_excel function.
# You can write out your files as well.