-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add property to refuse indexing #75
Comments
On WordPress we solved it with simply removing all NodeInfo endpoints: https://github.com/Automattic/wordpress-activitypub/blob/93b2f1ee7d1d740ff9f0821deca0a69664cbf928/includes/class-activitypub.php#L250 |
Since Nodeinfo is a common standard that can identify software types, etc., it may be tempting to remove it entirely. By adding such a property, we can reject the supported crawlers while maintaining nodeinfo's advantage. |
I see four broad client usage categories:
Can others think of more? What are we targeting here with this property? Adding this we should be clear in documentation about the intended use and that it'll require collaboration of the client, so it can never be interpreted as a privacy feature. Exposing certain statistics or not for privacy reasons is within the responsibility of the implementing server software always. |
I was primarily thinking of using it for public aggregators (misskey-dev/misskey#11213). |
as for the-federation.info, i could see a usecase - so what about an optional ...
"robots": {
"disallow": [],
"allow": ["*"]
},
... witch follows the robots.txt convention for agent definition? that way we stay generic, dont interfere with proposed usages ... |
-> #82 |
I'm not sure I'm a big fan of adopting the robots.txt language and referencing it here actually, it might get a little ambiguous on whether it's meant to restrict NodeInfo clients only or crawlers of any part of the website in general. Also "Web Robots" and "crawling" are not well defined terms in the standard anywhere so far. See the first paragraph of https://github.com/jhass/nodeinfo/blob/main/PROTOCOL.md Come to think of it we don't really specify anywhere that a client should have a specific identifier and communicate it to the server in a particular way. So we should probably extend the protocol in this regard. Perhaps more fitting terms that come to my mind would be things like If we wanted to avoid clients to have to pick an identifier we could also try to define some broad usage categories for the data and allow servers to pick which ones to use. Of course that wouldn't allow to exclude specific clients or include specific clients only (beyond the server blocking stuff via firewall rules). |
The approach I'm using in my crawler which feeds into https://nodes.fediverse.party/ and https://the-federation.info/ is: check robots.txt, check software-specific robots.txt (any any fields based on the same idea) seem an untenable solution to me, because it requires the administrator to keep their "disallow" lists up-to-date with all new crawlers. |
Known usecases: discovery; indexing; statistics; internal (node use it itselve); ... ? |
I would go for opt-out |
It would be nice to have a property that denies indexing to aggregation services (e.g. Mastodon Server Index), just like
<meta name="robots" content="noindex">
in HTML.Related downstream issue: misskey-dev/misskey#11213
The text was updated successfully, but these errors were encountered: