Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many instances have restrictive robots.txt #20

Open
Minoru opened this issue Nov 4, 2021 · 1 comment
Open

Many instances have restrictive robots.txt #20

Minoru opened this issue Nov 4, 2021 · 1 comment

Comments

@Minoru
Copy link
Owner

Minoru commented Nov 4, 2021

I just implemented support for robots.txt (#4), and I'm seeing a drop in the number of "alive" instances. Apparently Pleroma used to ship a deny-all robots.txt, and these days it's configurable.

I'm happy that this code works, but I'm unhappy that it hurts the statistics this much.

I think I'll deploy this spider as-is, and then start a conversation on what should be done about this. An argument could be made that, since the spider only accesses a fixed number of well-known locations, it should be exempt from robots.txt. OTOH, it's a robot, so robots.txt clearly apply.

@Minoru
Copy link
Owner Author

Minoru commented May 10, 2022

My logs indicate that 2477 nodes forbid access to their NodeInfo using robots.txt. That's a sizeable number, considering there's 7995 instances in my "alive" list at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant