You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Heritrix does not currently support the robots.txt wildcard extension. There is an open feature request for it at #250. I've updated the note to webmasters in the github wiki and the old confluence wiki to note this. Thanks!
Does Heritrix 3.3 support wildcards in robots.txt disallow directives? I urge that either "yes" or "no" answer be added to the documentation.
From my experimentation, it appears that it does not support wildcards. E.g.
Disallow: /*/output/
still crawled URLs like
/docview/5819152/FE3F6F718FE34D90PQ/5819152/5819152/Record/FE3F6F718FE34D90PQ/input/MathML
The text was updated successfully, but these errors were encountered: