robots.txt files are not being respected correctly #184

div72 · 2024-05-29T11:24:41Z

I have a robots.txt not being respected by spider-rs currently.

Upon debugging there seems to be two issues:

URLs are not getting passed to the RobotFileParser correctly. It seems that RobotFileParser is expecting just paths while Website::is_allowed_robots is passing whole URLs:

spider/spider/src/website.rs

Lines 401 to 407 in 4b8a604

    
               pub fn is_allowed_robots(&self, link: &str) -> bool { 
        
                   if self.configuration.respect_robots_txt { 
        
                       unsafe { 
        
                           self.robot_file_parser 
        
                               .as_ref() 
        
                               .unwrap_unchecked() 
        
                               .can_fetch("*", link)

Here's GDB output from the can_fetch call inside the function:

(gdb) p url_str
$16 = "https://div72.xyz"
(gdb) p default_entry.allowance(url_str)
$17 = true
(gdb) p default_entry.allowance("/")
$18 = false
(gdb) p default_entry.allowance("")
$19 = true

Website::is_allowed_robots is not passing the user-agent to the RobotFileParser. The '*' parameter in the code above should be self.configuration.user_agent rather than '*'.

The text was updated successfully, but these errors were encountered:

j-mendez · 2024-05-29T11:56:30Z

fixed in 1.95.21 thanks!

ubedan · 2024-06-26T07:13:42Z

I think it's still not quite right...

I had a robots.txt file that had:

Disallow: /careers/login

And I got a scrape:

...url.com%2Fcareers%2Flogin%3FreturnTo%3D%2Fcareers%2Ffield%2Dsales.html

Which decodes to:

...url.com/careers/login?returnTo=/careers/field-sales.html

Does everyone want me to take a stab at fixing it?

j-mendez · 2024-06-26T10:34:00Z

I think it's still not quite right...

I had a robots.txt file that had:

Disallow: /careers/login

And I got a scrape:

...url.com%2Fcareers%2Flogin%3FreturnTo%3D%2Fcareers%2Ffield%2Dsales.html

Which decodes to:

...url.com/careers/login?returnTo=/careers/field-sales.html

Does everyone want me to take a stab at fixing it?

Hello, what website did you use and what user-agent? This will help us test etc.

ubedan · 2024-06-26T15:25:50Z

Oxide.Computer, and I was using a slightly modified (set respect_robots_txt = true and subdomains=true) version of the download.rs example which uses SpiderBot.

let website_name = "https://oxide.computer";
let mut website: Website = Website::new(website_name);
website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.delay = 3340; // Defaults to 250 ms
website.configuration.user_agent = Some(Box::new("SpiderBot".into()));

j-mendez · 2024-06-26T22:13:03Z

@ubedan f32820c fixed in 1.98.3 ty

ubedan · 2024-06-26T23:31:00Z

Wow! Awesome! So fast!!! Thanks!

j-mendez added a commit that referenced this issue May 29, 2024

chore(robots): fix respect robots [#184]

5cf441a

j-mendez closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.txt files are not being respected correctly #184

robots.txt files are not being respected correctly #184

div72 commented May 29, 2024

j-mendez commented May 29, 2024

ubedan commented Jun 26, 2024

j-mendez commented Jun 26, 2024

ubedan commented Jun 26, 2024 •

edited

Loading

j-mendez commented Jun 26, 2024

ubedan commented Jun 26, 2024

robots.txt files are not being respected correctly #184

robots.txt files are not being respected correctly #184

Comments

div72 commented May 29, 2024

j-mendez commented May 29, 2024

ubedan commented Jun 26, 2024

j-mendez commented Jun 26, 2024

ubedan commented Jun 26, 2024 • edited Loading

j-mendez commented Jun 26, 2024

ubedan commented Jun 26, 2024

ubedan commented Jun 26, 2024 •

edited

Loading