Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case-sensitive robots.txt results in incorrect crawl delay #55

Closed
steffilazerte opened this issue Jul 15, 2020 · 2 comments
Closed

Case-sensitive robots.txt results in incorrect crawl delay #55

steffilazerte opened this issue Jul 15, 2020 · 2 comments

Comments

@steffilazerte
Copy link
Member

I'm trying to use the polite package for, well, polite, web-scraping. On problem I've run into is that it uses the robotstxt values for the crawl-delays, but in this specific example, it ends up with a crawl delay of 2000 (using the first line with *), which doesn't actually match the robots.txt values.

library(robotstxt)
r <- robotstxt("https://r-bloggers.com")
r$crawl_delay
#>         field useragent value
#> 1 Crawl-delay Googlebot     1
#> 2 Crawl-delay     spbot  2000
#> 3 Crawl-delay   BLEXBot  2000
#> 4 Crawl-delay         *  2000
#> 5 Crawl-delay         *    20

I think the problem is that one of the User-agents defined in the robots.txt file has a capital "A". Is this something that should definitely be fixed by the site, or would it be possible to make the argument matching case-insensitive?

https://r-bloggers.com/robots.txt

Showing only part...

User-agent: Googlebot-Mobile
Allow: /

User-agent: Googlebot
Crawl-delay: 1

User-agent: spbot
Crawl-delay: 2000

User-agent: BLEXBot
Crawl-delay: 2000

User-Agent: AhrefsBot 
Crawl-delay: 2000

User-agent: * 
Crawl-delay: 20

Thanks!

@steffilazerte steffilazerte changed the title Extracted crawl delay doesn't match robots.txt Case-sensitive robots.txt results in incorrect crawl delay Jul 15, 2020
@hrbrmstr
Copy link

FWIW https://developers.google.com/search/reference/robots_txt?hl=en suggests the values should be treated in a case-insensitive manner.

@steffilazerte
Copy link
Member Author

Good reference! Specifically: https://developers.google.com/search/reference/robots_txt?hl=en#file-format

But that still assumes that my interpretation of the problem (that's it's due to case-sensitivity) is correct (and I'm not sure of that)! 😁

@ropensci ropensci deleted a comment from fishinges Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants