Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More control over handling non-200 responses when scraping #32

Open
francisbarton opened this issue Oct 1, 2020 · 0 comments
Open

More control over handling non-200 responses when scraping #32

francisbarton opened this issue Oct 1, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@francisbarton
Copy link

I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think polite is great project and I'd like to see it used more widely.

With httr you can ask for the response code from a GET request to a URL, and then choose what action to take if, for example, the code is ! == 200. polite::scrape uses httr I believe, but handles the response internally, choosing to return NULL from a 404 for example. I'm wondering if it could be made less opinionated.

Here's a scraping script I wrote the other day, using purrr::map_dfr to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the NULL value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using purrr::possibly (ex 3 below) or maybe by just using map with a reduce(bind_rows) ... but it might be good if polite gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning NULL.

I hope that makes sense. Here's my examples:

library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)

url_root <- "https://www.ongelukvandaag.nl/archief/"

# create three URLs to test
urls <- paste0(url_root, 10:12, "-01-2015") # second URL returns 404

session <- polite::bow(
  url = url_root,
  user_agent = "Francis Barton fbarton@alwaysdata.net",
  delay = 3
)

function 1

scrape_page <- function(url) {
  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html", verbose = TRUE)

  headings <- page_text %>%
    rvest::html_nodes("h2") %>%
    rvest::html_text()

  dates <- page_text %>%
    rvest::html_nodes(".text-muted") %>%
    rvest::html_text() %>%
    stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

  dplyr::tibble(headings = headings, dates = dates)
}

# run function 1: breaks due to NULL return
purrr::map_dfr(urls, scrape_page)
#> Attempt number 2.
#> Attempt number 3.This is the last attempt, if it fails will return NULL
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"

function 2 - includes failsafe for 404s/NULL returns

scrape_page_safe <- function(url) {
  failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)

  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html")

  if (is.null(page_text)) {
    failsafe_tbl
  } else {
    headings <- page_text %>%
      rvest::html_nodes("h2") %>%
      rvest::html_text()

    dates <- page_text %>%
      rvest::html_nodes(".text-muted") %>%
      rvest::html_text() %>%
      stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

    dplyr::tibble(headings = headings, dates = dates)
  }
}

# run function 2: succeeds
purrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

function 3 - uses purrr::possibly with function 1 to handle errors

failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
purrr::map_dfr(urls,
  possibly(          # return a failsafe on error
    scrape_page,
    otherwise = failsafe_tbl
  )
)
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

Created on 2020-09-30 by the reprex package (v0.3.0)

@dmi3kno dmi3kno added the enhancement New feature or request label Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants