01-importing.Rmd


---
title: 'Week 03: Examples for Data Importing'
---

# Load packages
```{r, warning=F, message=F}
library(tidyverse)
library(readr)
library(readxl)
```

# Let's focus on readr package first
# Check the current directory

```{r}
getwd()
```


```{r}
cth_data <- read_csv("data/CTH_raw_na.csv")
```


# Visit Environmental Performance Index at https://epi.yale.edu/

# The Environmental Performance Index (EPI) ranks 180 countries on 32 performance indicators across 11 issue  
# categories covering environmental health and ecosystem vitality. 
# Specifically, environmental performance indicators (EPIs) examine environmental issues such as pollution, biodiversity, climate, # energy, erosion, ecosystem services, environmental education, and many others. 

#####This year, Denmark ranks #1 in the EPI#####
## https://epi.yale.edu/downloads/epipolicymakersummaryr11.pdf

# Go to "Downloads > EPI2020Raw data"
# https://epi.yale.edu/downloads/epi2020rawdata20200604v02.zip
# Download this file to your Desktop and unzip this file. 
# Focus on CTH_raw_na file and move this file to data folder under your working directory

```{r}
# move the file to working directory
#file.copy("/Users/gulinan/Desktop/epi2020rawdata20200604v02/CTH_raw_na.csv", "data/CTH_raw_na.csv")
# When your working directory is the path provided by
# getwd() function, but you want to access the data folder and 
# list all the files in it:
list.files(file.path(getwd(), "data"))
```

# CTH_raw_na is a csv file 
# Have a look at the first three lines of the CTH data

```{r}
# read_lines is a function from readr package
# ?read_lines
# read_lines() reads up to n_max lines from a file.
read_lines("data/CTH_raw_na.csv", n_max=3)
```

# The first line is header
# read_* functions all assume that col_names=T by default.
# read_* functions always skip empty rows through ski_empty_rows=T by default


```{r}
# more info
?read_csv
```

```{r}
# Read the data into R console
cth_data <- read_csv("data/CTH_raw_na.csv")
```

# When everything is OK, you will see your data under Environement section

```{r}
# Investigate data carefully
View(cth_data)
```

```{r}
# Check the types of variables: character, numeric, integer
str(cth_data)
```

# Note: NA stands for missing values in the data 

```{r}
# Check the class of the data
class(cth_data)
```

```{r}
# Check the dimensions
dim(cth_data)
```

```{r}
# Check the column names
colnames(cth_data)
```

```{r}
# Not comfortable with column names?
colnames(cth_data)[-seq(1:3)] <- paste(seq(1950,2014), sep = "")
```

```{r}
# Check the column names one more time!
colnames(cth_data)
```


# Since in readr function path is defined as:
# "Either a path to a file, a connection",
# we can also import csv files into R as well.
# Let's import a csv file on the internet into R now

# Visit Google' Covid-19 mobility report
# at https://www.google.com/covid19/mobility/
# Global data is available to be downloaded as a csv file
# at the following url:
# https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=c050b74b9ee831a7

```{r}
# First specify the url address of the data
url <- "https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=c050b74b9ee831a7"
# Then read it into R (It takes a bit time)
google_mobility <- read_csv(url)
```

```{r}
# Investigate it as you wish
View(google_mobility)
```

```{r}
# or download this file into your local computer
# download.file functions is under utils library
download.file(url, "data/google_mobility.csv")
```


# Let's focus on readxl package
# readxl does not come with tidyverse. 
# For that reason, install readxl package and then load it
```{r}
# install.packages("readxl")
library(readxl)
```

# Visit at the web site of Monitoring the situation of 
# children and women in Europe and Central Asia
# at http://transmonee.org/
# Go to http://transmonee.org/database/ and 
# Download Excel file named
# "Population at the beginning of the year by sex and selected age groups"
# into to your working directory and rename it as Population-1989-2019
# since the file name is very long.

```{r}
# List the sheet names
excel_sheets("data/Population-1989-2019.xlsx")
```

# read_excel() reads both xls and xlsx files and detects the format from the extension.

```{r}
pop <- read_excel("data/Population-1989-2019.xlsx", range= cell_rows(5:40))
```

```{r}
View(pop)
```

# The excel file is very crowded. It consists of many tables and notes in text format
# Let's focus on Total population on January 1 which is located between 5th and
# 40th rows. Note that the 5th row is a heading.

```{r}
# By default, it prints the first sheet
pop_red <- read_excel("data/Population-1989-2019.xlsx", range = cell_rows(5:40))
View(pop_red)
```


```{r}
colnames(pop)[1:2] <- c("Country", "Number")
```


```{r}
colnames(pop)[-(1:2)] <- paste("y", 1989:2019, sep="")
```

```{r}
View(pop)
```

```{r}
colnames(pop_red)[1:2] <- c("Country", "Number")
View(pop_red)
```

# Now it seems a bit OK! Since new data set still involves
# a few empty rows!..We can get rid of these rows
# when we are familiar with dplyr package. 

# Maybe, it is better to get more help from 
# https://readxl.tidyverse.org/articles/sheet-geometry.html
# https://readxl.tidyverse.org/articles/cell-and-column-types.html


# Sometimes, the Excel file may involve multiple sheets
# Go to http://transmonee.org/database/download/ and
# Download the Excel file of TransMonEE full database for 2019 and
# save the file into your working directory. Do this only once.
```{r}
url <- "http://transmonee.org/wp-content/uploads/2016/05/TM-2019-EN-June.xlsx"
download.file(url, "data/TM-2019-EN-June.xlsx")
```
# Check the number of sheets
```{r}
excel_sheets("data/TM-2019-EN-June.xlsx")
```
# It consists of 6 sheets
```{r}
full_data <- read_excel("data/TM-2019-EN-June.xlsx")
View(full_data)
```

# By default, it prints the first sheet
```{r}
juvenile_data <- read_excel("data/TM-2019-EN-June.xlsx", sheet = "5. Juvenile Justice & Crime")
View(juvenile_data)
```

# More work has to be done to retrieve this data!
# Any idea and suggestion may contribute to the R Community!


################## Study at home ###########################################################################
# Iterating over multiple files or multiple
# sheets can be done via purr package. 
# https://readxl.tidyverse.org/articles/articles/readxl-workflows.html

# If you have any solutions, share it on Twitter with #mat381e #rstats and #unicef hashtags!..

# https://rviews.rstudio.com/2019/10/09/building-interactive-world-maps-in-shiny/


# Sometimes, your data may be in a Google SpreadSheet
# Such as the gapminder data at 
# https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077
# More info can be found at 
# https://moderndive.com/2-viz.html
# Then you can use googlesheets4 package to dowload 
# this data into your working directory

```{r}
#install.packages("googlesheets4")
library(googlesheets4)
url <- "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
gapmind <- read_sheet(url)
```


# On the use of this package,
# it is better to read this
# https://www.tidyverse.org/google_privacy_policy/

# Or you can just download the Google Spreadsheet as a csv or xls file and then
# read it via read_csv or read_excel function.

# You can write out your files as well.