You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I recall correctly, Hyphe currently does not respect robots.txt files. Depending on their ethical framework it may however be important for researchers to be able to exclude pages that are explicitly marked as non-crawlable.
Could this be made an option? I don't know exactly how Hyphe interfaces with Scrapy, but perhaps Scrapy's robots.txt middleware could be toggled for this?
A nice bonus would be if it were possible to also exclude links marked with rel="nofollow", but I see this is already covered in another issue (#86).
The text was updated successfully, but these errors were encountered:
If I recall correctly, Hyphe currently does not respect robots.txt files. Depending on their ethical framework it may however be important for researchers to be able to exclude pages that are explicitly marked as non-crawlable.
Could this be made an option? I don't know exactly how Hyphe interfaces with Scrapy, but perhaps Scrapy's robots.txt middleware could be toggled for this?
A nice bonus would be if it were possible to also exclude links marked with
rel="nofollow"
, but I see this is already covered in another issue (#86).The text was updated successfully, but these errors were encountered: