Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add in english names to jpnprefs dataset #21

Merged
merged 2 commits into from
Sep 20, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 51 additions & 3 deletions data-raw/jpnprefs.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ library(tidyverse)
# dplyr # 0.7.6
# tidyr # 0.8.1
# purrr # 0.2.5
# (stringr) # 1.3.1

library(polite) # 0.0.0.9004
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we introduce the polite package?
This package is certainly useful, but has not yet been registered with CRAN.

Copy link
Contributor Author

@Ryo-N7 Ryo-N7 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I suppose it's not entirely necessary to use this package for now. We're only scraping from Wikipedia anyways. I use it as part of my workflow but I understand from a package development/maintenance point of view that it's not necessary.

We can just replace it with the regular rvest code instead:

url <- "https://en.wikipedia.org/wiki/Prefectures_of_Japan"
jpn_pref_raw <- read_html(url) %>% 
    html_nodes("table.wikitable:nth-child(49)") %>% 
    html_table() %>% 
    purrr::flatten_df()

url2 <- "https://en.wikipedia.org/wiki/List_of_Japanese_prefectures_by_population"
jpn_pref2_raw <- read_html(url) %>% 
  html_nodes("table.wikitable:nth-child(7)") %>% 
  html_table() %>% 
  purrr::flatten_df()




# Japanese ----------------------------------------------------------------
Expand All @@ -22,7 +26,7 @@ x <-

df <-
x %>%
html_nodes(css = "#mw-content-text > div > table.wikitable.sortable") %>%
html_nodes(css = "table.wikitable:nth-child(104)") %>% # css to correct table as wiki page was edited
html_table(fill = TRUE) %>%
purrr::flatten_df() %>%
select(2, 4, 6, 11) %>%
Expand Down Expand Up @@ -92,10 +96,54 @@ jpnprefs %<>%
select(jis_code, prefecture, capital, region, major_island, capital_latitude = latitude, capital_longitude = longitude) %>%
as_tibble()

# ---- English region and island names
url <- "https://en.wikipedia.org/wiki/Prefectures_of_Japan"

session <- bow(url)

jpn_pref_raw <- scrape(session) %>%
html_nodes("table.wikitable:nth-child(49)") %>%
#.[[1]] %>%
html_table() %>%
purrr::flatten_df()

jpn_pref_df <- jpn_pref_raw %>%
janitor::clean_names() %>%
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not feel motivated to use janitor for this process.
Is it possible to change to a method using dplyr::select() which explicitly selects and renames a variable?

Copy link
Contributor Author

@Ryo-N7 Ryo-N7 Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure! I'll do it similar to how you set the names for the Japanese table using set_colnames():

jpn_pref_df <- jpn_pref_raw %>% 
  select(2, 4, 5) %>% 
  set_colnames(c("kanji", "region_en", "major_island_en")) %>% 
  mutate(region_en = region_en %>% iconv(from = "UTF-8", to = "ASCII//TRANSLIT")) 

select(kanji, region_en = region, major_island_en = major_island) %>%
mutate(region_en = region_en %>% iconv(from = "UTF-8", to = "ASCII//TRANSLIT"))

# ---- English prefecture and capital names
url2 <- "https://en.wikipedia.org/wiki/List_of_Japanese_prefectures_by_population"

session2 <- bow(url2)

jpn_pref2_raw <- scrape(session2) %>%
html_nodes("table.wikitable:nth-child(7)") %>%
#.[[1]] %>%
html_table() %>%
purrr::flatten_df()

jpn_pref2_df <- jpn_pref2_raw %>%
janitor::clean_names() %>%
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jpn_pref2_df <- jpn_pref2_raw %>% 
  select(3, 2, 4) %>% 
  set_colnames(c("kanji", "prefecture_en", "capital_en")) %>% 
  mutate(prefecture_en = prefecture_en %>% iconv(from = "UTF-8", to = "ASCII//TRANSLIT"),
         capital_en = capital_en %>% iconv(from = "UTF-8", to = "ASCII//TRANSLIT"))

select(kanji = japanese, prefecture_en = prefectures, capital_en = capital) %>%
mutate(prefecture_en = prefecture_en %>% iconv(from = "UTF-8", to = "ASCII//TRANSLIT"),
capital_en = capital_en %>% iconv(from = "UTF-8", to = "ASCII//TRANSLIT"))

# ---- Join with jpnprefs
jpnprefs <- jpnprefs %>%
left_join(jpn_pref_df, by = c("prefecture" = "kanji")) %>%
left_join(jpn_pref2_df, by = c("prefecture" = "kanji")) %>%
select(jis_code, prefecture, capital, region, major_island,
prefecture_en, capital_en, region_en, major_island_en,
capital_latitude, capital_longitude) %>%
as_tibble()

expect_named(jpnprefs,
c("jis_code", "prefecture", "capital", "region", "major_island", "capital_latitude", "capital_longitude"))
c("jis_code", "prefecture", "capital", "region", "major_island",
"prefecture_en", "capital_en", "region_en", "major_island_en",
"capital_latitude", "capital_longitude"))
expect_equal(dim(jpnprefs),
c(47, 7))
c(47, 11))
expect_s3_class(jpnprefs,
c("data.frame", "tbl_df"))

Expand Down
Binary file modified data/jpnprefs.rda
Binary file not shown.
Binary file modified inst/extdata/jpnprefs.rds
Binary file not shown.