You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think polite is great project and I'd like to see it used more widely.
With httr you can ask for the response code from a GET request to a URL, and then choose what action to take if, for example, the code is ! == 200. polite::scrape uses httr I believe, but handles the response internally, choosing to return NULL from a 404 for example. I'm wondering if it could be made less opinionated.
Here's a scraping script I wrote the other day, using purrr::map_dfr to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the NULL value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using purrr::possibly (ex 3 below) or maybe by just using map with a reduce(bind_rows) ... but it might be good if polite gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning NULL.
I hope that makes sense. Here's my examples:
library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)
url_root<-"https://www.ongelukvandaag.nl/archief/"# create three URLs to testurls<- paste0(url_root, 10:12, "-01-2015") # second URL returns 404session<-polite::bow(
url=url_root,
user_agent="Francis Barton fbarton@alwaysdata.net",
delay=3
)
function 1
scrape_page<-function(url) {
page_text<-polite::nod(session, url) %>%
polite::scrape(accept="html", verbose=TRUE)
headings<-page_text %>%
rvest::html_nodes("h2") %>%
rvest::html_text()
dates<-page_text %>%
rvest::html_nodes(".text-muted") %>%
rvest::html_text() %>%
stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")
dplyr::tibble(headings=headings, dates=dates)
}
# run function 1: breaks due to NULL returnpurrr::map_dfr(urls, scrape_page)
#> Attempt number 2.#> Attempt number 3.This is the last attempt, if it fails will return NULL#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/#> 11-01-2015#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"
function 2 - includes failsafe for 404s/NULL returns
scrape_page_safe<-function(url) {
failsafe_tbl<-dplyr::tibble(headings=NA_character_, dates=NA_character_)
page_text<-polite::nod(session, url) %>%
polite::scrape(accept="html")
if (is.null(page_text)) {
failsafe_tbl
} else {
headings<-page_text %>%
rvest::html_nodes("h2") %>%
rvest::html_text()
dates<-page_text %>%
rvest::html_nodes(".text-muted") %>%
rvest::html_text() %>%
stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")
dplyr::tibble(headings=headings, dates=dates)
}
}
# run function 2: succeedspurrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/#> 11-01-2015#> # A tibble: 8 x 2#> headings dates #> <chr> <chr> #> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot. 10-01-20~#> 3 <NA> <NA> #> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen. 12-01-20~#> 5 Zware ochtendspits door ongelukken. 12-01-20~#> 6 Zwaargewonde bij aanrijding in Huissen. 12-01-20~#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen. 12-01-20~#> 8 Twee gewonden bij ongeluk Ochten. 12-01-20~
function 3 - uses purrr::possibly with function 1 to handle errors
failsafe_tbl<-dplyr::tibble(headings=NA_character_, dates=NA_character_)
purrr::map_dfr(urls,
possibly( # return a failsafe on errorscrape_page,
otherwise=failsafe_tbl
)
)
#> # A tibble: 8 x 2#> headings dates #> <chr> <chr> #> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot. 10-01-20~#> 3 <NA> <NA> #> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen. 12-01-20~#> 5 Zware ochtendspits door ongelukken. 12-01-20~#> 6 Zwaargewonde bij aanrijding in Huissen. 12-01-20~#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen. 12-01-20~#> 8 Twee gewonden bij ongeluk Ochten. 12-01-20~
I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think
polite
is great project and I'd like to see it used more widely.With
httr
you can ask for the response code from aGET
request to a URL, and then choose what action to take if, for example, the code is! == 200
.polite::scrape
useshttr
I believe, but handles the response internally, choosing to returnNULL
from a 404 for example. I'm wondering if it could be made less opinionated.Here's a scraping script I wrote the other day, using
purrr::map_dfr
to combine responses into a single tibble. But if one of a list of URLs returns a 404 then theNULL
value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by usingpurrr::possibly
(ex 3 below) or maybe by just usingmap
with areduce(bind_rows)
... but it might be good ifpolite
gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returningNULL
.I hope that makes sense. Here's my examples:
function 1
function 2 - includes failsafe for 404s/NULL returns
function 3 - uses
purrr::possibly
with function 1 to handle errorsCreated on 2020-09-30 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: