-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robots.txt files are not being respected correctly #184
Comments
fixed in |
I think it's still not quite right... I had a robots.txt file that had: Disallow: /careers/login And I got a scrape: ...url.com%2Fcareers%2Flogin%3FreturnTo%3D%2Fcareers%2Ffield%2Dsales.html Which decodes to: ...url.com/careers/login?returnTo=/careers/field-sales.html Does everyone want me to take a stab at fixing it? |
Hello, what website did you use and what user-agent? This will help us test etc. |
Oxide.Computer, and I was using a slightly modified (set respect_robots_txt = true and subdomains=true) version of the download.rs example which uses SpiderBot.
|
Wow! Awesome! So fast!!! Thanks! |
I have a robots.txt not being respected by spider-rs currently.
Upon debugging there seems to be two issues:
RobotFileParser
correctly. It seems thatRobotFileParser
is expecting just paths whileWebsite::is_allowed_robots
is passing whole URLs:spider/spider/src/website.rs
Lines 401 to 407 in 4b8a604
Here's GDB output from the can_fetch call inside the function:
Website::is_allowed_robots
is not passing the user-agent to theRobotFileParser
. The'*'
parameter in the code above should beself.configuration.user_agent
rather than'*'
.The text was updated successfully, but these errors were encountered: