Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt files are not being respected correctly #184

Closed
div72 opened this issue May 29, 2024 · 6 comments
Closed

robots.txt files are not being respected correctly #184

div72 opened this issue May 29, 2024 · 6 comments

Comments

@div72
Copy link

div72 commented May 29, 2024

I have a robots.txt not being respected by spider-rs currently.

Upon debugging there seems to be two issues:

  1. URLs are not getting passed to the RobotFileParser correctly. It seems that RobotFileParser is expecting just paths while Website::is_allowed_robots is passing whole URLs:
    pub fn is_allowed_robots(&self, link: &str) -> bool {
    if self.configuration.respect_robots_txt {
    unsafe {
    self.robot_file_parser
    .as_ref()
    .unwrap_unchecked()
    .can_fetch("*", link)

Here's GDB output from the can_fetch call inside the function:

(gdb) p url_str
$16 = "https://div72.xyz"
(gdb) p default_entry.allowance(url_str)
$17 = true
(gdb) p default_entry.allowance("/")
$18 = false
(gdb) p default_entry.allowance("")
$19 = true
  1. Website::is_allowed_robots is not passing the user-agent to the RobotFileParser. The '*' parameter in the code above should be self.configuration.user_agent rather than '*'.
@j-mendez
Copy link
Member

fixed in 1.95.21 thanks!

@ubedan
Copy link

ubedan commented Jun 26, 2024

I think it's still not quite right...

I had a robots.txt file that had:

Disallow: /careers/login

And I got a scrape:

...url.com%2Fcareers%2Flogin%3FreturnTo%3D%2Fcareers%2Ffield%2Dsales.html

Which decodes to:

...url.com/careers/login?returnTo=/careers/field-sales.html

Does everyone want me to take a stab at fixing it?

@j-mendez
Copy link
Member

I think it's still not quite right...

I had a robots.txt file that had:

Disallow: /careers/login

And I got a scrape:

...url.com%2Fcareers%2Flogin%3FreturnTo%3D%2Fcareers%2Ffield%2Dsales.html

Which decodes to:

...url.com/careers/login?returnTo=/careers/field-sales.html

Does everyone want me to take a stab at fixing it?

Hello, what website did you use and what user-agent? This will help us test etc.

@ubedan
Copy link

ubedan commented Jun 26, 2024

Oxide.Computer, and I was using a slightly modified (set respect_robots_txt = true and subdomains=true) version of the download.rs example which uses SpiderBot.

let website_name = "https://oxide.computer";
let mut website: Website = Website::new(website_name);
website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.delay = 3340; // Defaults to 250 ms
website.configuration.user_agent = Some(Box::new("SpiderBot".into()));

@j-mendez
Copy link
Member

@ubedan f32820c fixed in 1.98.3 ty

@ubedan
Copy link

ubedan commented Jun 26, 2024

Wow! Awesome! So fast!!! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants