Respect robots.txt #376

stijn-uva · 2020-03-09T12:23:50Z

If I recall correctly, Hyphe currently does not respect robots.txt files. Depending on their ethical framework it may however be important for researchers to be able to exclude pages that are explicitly marked as non-crawlable.

Could this be made an option? I don't know exactly how Hyphe interfaces with Scrapy, but perhaps Scrapy's robots.txt middleware could be toggled for this?

A nice bonus would be if it were possible to also exclude links marked with rel="nofollow", but I see this is already covered in another issue (#86).

The text was updated successfully, but these errors were encountered:

boogheta · 2020-03-09T12:27:57Z

Yes that definitely would be a nice feature, and not too complex to implement among other advanced crawl features at least via the API for a start

boogheta added crawler feature labels Mar 9, 2020

stijn-uva mentioned this issue Sep 30, 2021

Setting to make scrapy ignore/follow robots.txt #421

Closed

boogheta closed this as completed Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect robots.txt #376

Respect robots.txt #376

stijn-uva commented Mar 9, 2020

boogheta commented Mar 9, 2020

Respect robots.txt #376

Respect robots.txt #376

Comments

stijn-uva commented Mar 9, 2020

boogheta commented Mar 9, 2020