2/21/2022
I’ve created a tableau dashboard which is an exploratory view of top ranked and popular anime titles from data at http://myanimelist.net. The purpose of this dashboard is to assist the end-user in locating their next anime to watch. However, what is lacking from this tool is the streaming platform that currently houses the titles.
To add that information into the dashboard, I need to locate or scrape the anime titles that are housed on various platforms.
The most popular platforms for viewing anime titles are:
- Crunchyroll
- Funimation
- Netflix
- Hulu
- Amazon Prime
This list is not all-inclusive and in no particular order.
This code will encompass downloading Funimation titles that have been scraped from the MyAnimeList company page for Funimation at https://myanimelist.net/anime/producer/102/Funimation
Set your working directory to wherever you’d like in the
WORKINDIRECTORY
section.
wd1 = WORKINGDIRECTORY
setwd(wd1)
The tidyverse
contains the rvest
library which will be used to
scrape the data, while dplyr
and tidyr
will be used to
manipulate/transform data.
install.packages("tidyverse")
Next, we want to load two libraries:
library(tidyverse)
library(rvest)
Although rvest
is a component of the tidyverse, it doesn’t
automatically load with the library call tidyverse
, as a result,
you’ll need to load it separately.
Funimation does have a listing of all of their titles on a webpage at: https://www.funimation.com/shows/. But it does not allow scraping that I am aware of.
The MyAnimeList website does have a page dedicated to Funimation which carries other attributes relating to ranking and number of users that we will be scraping at https://myanimelist.net/anime/producer/102/Funimation.
funimation <- read_html('https://myanimelist.net/anime/producer/102/Funimation')
right_side_bucket <- funimation %>%
html_element(xpath = '/html/body/div[1]/div[2]/div[3]/div[2]/div[2]')
I wasn’t able to identify a single container for the table that is used for all of the titles, so I’ve selected the right side of the screen as my starting point.
Grabbing all link information and any associated text strings between those links will be my starting point.
# grab links
link <- right_side_bucket %>%
html_elements('a') %>%
html_attr('href')
# grab text between link
link_text <- right_side_bucket %>%
html_elements('a') %>%
html_text(trim = TRUE)
head(link)
## [1] "https://myanimelist.net/anime/producer/102/Funimation"
## [2] "https://myanimelist.net/anime/producer/102/Funimation/news"
## [3] "https://myanimelist.net/"
## [4] "https://myanimelist.net/company"
## [5] "https://myanimelist.net/anime/producer/102/Funimation"
## [6] "https://myanimelist.net/anime/producer/102/Funimation/news"
head(link_text)
## [1] "Details" "News" "Top" "Companies" "Funimation"
## [6] "More News"
The structure of the website also stores the following within span
elements:
- members - the number of members who have watched or are watching this title
- score - the average score (from 1 to 10) from all users
- start_date - the start date of the anime
- title - the title of the anime
I’ll grab those components by selecting the specific classes
within
the span
elements:
# grab members
members <- right_side_bucket %>%
html_elements('span.js-members') %>%
html_text(trim = TRUE)
# grab score
score <- right_side_bucket %>%
html_elements('span.js-score') %>%
html_text(trim = TRUE)
# grab Startdate
start_date <- right_side_bucket %>%
html_elements('span.js-start_date') %>%
html_text(trim = TRUE)
# grab title
anime_title <- right_side_bucket %>%
html_elements('span.js-title') %>%
html_text(trim = TRUE)
length(link)
## [1] 3614
length(link_text)
## [1] 3614
length(members)
## [1] 1200
length(score)
## [1] 1200
length(start_date)
## [1] 1200
length(anime_title)
## [1] 1200
We can see that the links information have the same length and the span information also have the same lengths. However, in order to combine the two we will need to clean up the link information. Lets look at each of the collected information below:
knitr::kable(head(link))
knitr::kable(head(link_text))
x |
---|
Details |
News |
Top |
Companies |
Funimation |
More News |
knitr::kable(head(members))
x |
---|
1557631 |
641079 |
103412 |
146845 |
1789268 |
301804 |
knitr::kable(head(score))
x |
---|
8.76 |
8.22 |
7.26 |
8.14 |
8.62 |
7.90 |
knitr::kable(head(start_date))
x |
---|
19980403 |
19980401 |
20020702 |
20040417 |
19991020 |
20041005 |
knitr::kable(head(anime_title))
x |
---|
Cowboy Bebop |
Trigun |
Witch Hunter Robin |
Initial D Fourth Stage |
One Piece |
School Rumble |
Since links needs some work, lets merged them together:
link_combo <- data.frame(link = link, link_text = link_text, stringsAsFactors = FALSE)
knitr::kable(head(link_combo, n=20))
It looks like the first 15 links aren’t pertinent to what we’re looking for so lets get rid of those:
link_combo <- tail(link_combo, n=3599)
knitr::kable(head(link_combo))
Now, lets look at the bottom of the list
knitr::kable(tail(link_combo))
The last one can be dropped off and we also want to remove the entries that don’t have text associated.
nrow(link_combo)
## [1] 3599
link_combo <- head(link_combo, n=3598)
lapply(link_combo,function(x) { length(which(is.na(x)))})
## $link
## [1] 0
##
## $link_text
## [1] 0
So none of the columns are null by looking at the table, however, many of them are filled with empty strings.
We can run the following script to remove any rows that have an empty string
link_combo <- link_combo[!apply(link_combo == "", 1, any), ]
nrow(link_combo)
## [1] 1200
knitr::kable(head(link_combo))
link | link_text | |
---|---|---|
16 | https://myanimelist.net/anime/1/Cowboy_Bebop | Cowboy Bebop |
19 | https://myanimelist.net/anime/6/Trigun | Trigun |
22 | https://myanimelist.net/anime/7/Witch_Hunter_Robin | Witch Hunter Robin |
25 | https://myanimelist.net/anime/18/Initial_D_Fourth_Stage | Initial D Fourth Stage |
28 | https://myanimelist.net/anime/21/One_Piece | One Piece |
31 | https://myanimelist.net/anime/24/School_Rumble | School Rumble |
Note: using the ‘any’ function allows removal of any row that has an empty string. If I wanted to remove rows which had empty strings in all columns, I would simply change ‘any’ to ‘all’.
Now we have all dataframes and vectors of the same length and we can combine them into one dataframe.
Before I do that, I want to extract the mal_id from the link. This is because I have another table with this ID that I would like to join at some point.
To do that, I need to extract it from the link. There are a couple of ways to do this:
- I can use the dplyr package to separate the link column based on a delimiter ‘/’ then pick the correct entry as my mal_id
- I can use a general expression to extract the item and add it as a new column.
Lets do the latter since general expressions are vital for string operations.
#For details on how to create a regex for extracting a string between two strings: https://regexland.com/all-between-specified-characters/
link_combo$mal_id <- str_extract(link_combo$link, "(?<=anime/).*(?=/)")
knitr::kable(head(link_combo))
link | link_text | mal_id | |
---|---|---|---|
16 | https://myanimelist.net/anime/1/Cowboy_Bebop | Cowboy Bebop | 1 |
19 | https://myanimelist.net/anime/6/Trigun | Trigun | 6 |
22 | https://myanimelist.net/anime/7/Witch_Hunter_Robin | Witch Hunter Robin | 7 |
25 | https://myanimelist.net/anime/18/Initial_D_Fourth_Stage | Initial D Fourth Stage | 18 |
28 | https://myanimelist.net/anime/21/One_Piece | One Piece | 21 |
31 | https://myanimelist.net/anime/24/School_Rumble | School Rumble | 24 |
Ok, now lets put them all together
funimation_table <- data.frame(mal_id = link_combo$mal_id, anime_title = link_combo$link_text, score = score, members = members, start_date = start_date, url = link_combo$link, url_name = link_combo$link_text, stringsAsFactors = FALSE)
identical(funimation_table[['anime_title']],funimation_table[['url_name']])
## [1] TRUE
Checking to ensure that the anime_title and the url_name are identical is a way to ensure that the order of the two datasets are aligned.
Lets see what the table looks like now:
knitr::kable(head(funimation_table))
mal_id | anime_title | score | members | start_date | url | url_name |
---|---|---|---|---|---|---|
1 | Cowboy Bebop | 8.76 | 1557631 | 19980403 | https://myanimelist.net/anime/1/Cowboy_Bebop | Cowboy Bebop |
6 | Trigun | 8.22 | 641079 | 19980401 | https://myanimelist.net/anime/6/Trigun | Trigun |
7 | Witch Hunter Robin | 7.26 | 103412 | 20020702 | https://myanimelist.net/anime/7/Witch_Hunter_Robin | Witch Hunter Robin |
18 | Initial D Fourth Stage | 8.14 | 146845 | 20040417 | https://myanimelist.net/anime/18/Initial_D_Fourth_Stage | Initial D Fourth Stage |
21 | One Piece | 8.62 | 1789268 | 19991020 | https://myanimelist.net/anime/21/One_Piece | One Piece |
24 | School Rumble | 7.90 | 301804 | 20041005 | https://myanimelist.net/anime/24/School_Rumble | School Rumble |
We have a redundant title column that can be removed, then we can export it.
funimation_table <- funimation_table[, 1:(length(funimation_table)-1)]
knitr::kable(head(funimation_table))
mal_id | anime_title | score | members | start_date | url |
---|---|---|---|---|---|
1 | Cowboy Bebop | 8.76 | 1557631 | 19980403 | https://myanimelist.net/anime/1/Cowboy_Bebop |
6 | Trigun | 8.22 | 641079 | 19980401 | https://myanimelist.net/anime/6/Trigun |
7 | Witch Hunter Robin | 7.26 | 103412 | 20020702 | https://myanimelist.net/anime/7/Witch_Hunter_Robin |
18 | Initial D Fourth Stage | 8.14 | 146845 | 20040417 | https://myanimelist.net/anime/18/Initial_D_Fourth_Stage |
21 | One Piece | 8.62 | 1789268 | 19991020 | https://myanimelist.net/anime/21/One_Piece |
24 | School Rumble | 7.90 | 301804 | 20041005 | https://myanimelist.net/anime/24/School_Rumble |
write.csv(funimation_table, 'funimationtitles.csv', row.names = FALSE)