Add property to refuse indexing #75

kakkokari-gtyih · 2023-12-06T07:27:34Z

It would be nice to have a property that denies indexing to aggregation services (e.g. Mastodon Server Index), just like <meta name="robots" content="noindex"> in HTML.

Related downstream issue: misskey-dev/misskey#11213

The text was updated successfully, but these errors were encountered:

pfefferle · 2023-12-06T11:36:32Z

On WordPress we solved it with simply removing all NodeInfo endpoints: https://github.com/Automattic/wordpress-activitypub/blob/93b2f1ee7d1d740ff9f0821deca0a69664cbf928/includes/class-activitypub.php#L250

kakkokari-gtyih · 2023-12-06T12:10:27Z

Since Nodeinfo is a common standard that can identify software types, etc., it may be tempting to remove it entirely. By adding such a property, we can reject the supported crawlers while maintaining nodeinfo's advantage.

jhass · 2023-12-06T12:24:45Z

I see four broad client usage categories:

Public statistic aggregators and server lists - publicly listing servers and perhaps keeping a history of their stats
Inquiry services - Stuff where you can find out a specific thing about a server you already know (like https://version.diaspora.social/)
Private statistic aggregators - People scraping the network for fun and/or (scientific) profit
Other services - Hide/show certain features to a local user depending on the capabilities of the a remote server, efficiency optimizations for pushing out content to other servers (don't bother to send this here, server is too old/doesn't support feature)

Can others think of more?

What are we targeting here with this property?

Adding this we should be clear in documentation about the intended use and that it'll require collaboration of the client, so it can never be interpreted as a privacy feature. Exposing certain statistics or not for privacy reasons is within the responsibility of the implementing server software always.

kakkokari-gtyih · 2023-12-06T12:50:56Z

What are we targeting here with this property?

I was primarily thinking of using it for public aggregators (misskey-dev/misskey#11213).

6543 · 2023-12-10T15:12:21Z

as for the-federation.info, i could see a usecase - so what about an optional robots entry:

...
"robots": {
  "disallow": [],
  "allow": ["*"]
},
...

witch follows the robots.txt convention for agent definition?

that way we stay generic, dont interfere with proposed usages ...

6543 · 2023-12-17T16:29:46Z

-> #82

jhass · 2023-12-21T11:48:25Z

I'm not sure I'm a big fan of adopting the robots.txt language and referencing it here actually, it might get a little ambiguous on whether it's meant to restrict NodeInfo clients only or crawlers of any part of the website in general. Also "Web Robots" and "crawling" are not well defined terms in the standard anywhere so far. See the first paragraph of https://github.com/jhass/nodeinfo/blob/main/PROTOCOL.md

Come to think of it we don't really specify anywhere that a client should have a specific identifier and communicate it to the server in a particular way. So we should probably extend the protocol in this regard.

Perhaps more fitting terms that come to my mind would be things like client_policy, allowed_clients, allowed_usages or something along those lines.

If we wanted to avoid clients to have to pick an identifier we could also try to define some broad usage categories for the data and allow servers to pick which ones to use. Of course that wouldn't allow to exclude specific clients or include specific clients only (beyond the server blocking stuff via firewall rules).

Minoru · 2023-12-21T15:45:07Z

The approach I'm using in my crawler which feeds into https://nodes.fediverse.party/ and https://the-federation.info/ is: check robots.txt, check software-specific private/hide_in_statistics properties (for GNU Social, Friendica, Hubzilla, and Red). I wish NodeInfo had standardized a flag which said "don't include me into any data sets, don't count me toward any statistics".

robots.txt (any any fields based on the same idea) seem an untenable solution to me, because it requires the administrator to keep their "disallow" lists up-to-date with all new crawlers.

6543 · 2023-12-21T19:16:58Z

Known usecases: discovery; indexing; statistics; internal (node use it itselve); ... ?

6543 · 2023-12-21T19:17:35Z

I would go for opt-out

kakkokari-gtyih mentioned this issue Dec 6, 2023

joinmisskeyなどのサーバー一覧ページに掲載を断りたい misskey-dev/misskey#11213

Open

6543 mentioned this issue Dec 10, 2023

next (v2.2) nodeinfo schema version #76

Open

7 tasks

6543 mentioned this issue Dec 17, 2023

Add "robots" to signal who is allowed to use the nodeinfo any future. #82

Open

Minoru mentioned this issue Dec 17, 2023

Support "robots" property in NodeInfo Minoru/minoru-fediverse-crawler#243

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add property to refuse indexing #75

Add property to refuse indexing #75

kakkokari-gtyih commented Dec 6, 2023 •

edited

Loading

pfefferle commented Dec 6, 2023

kakkokari-gtyih commented Dec 6, 2023 •

edited

Loading

jhass commented Dec 6, 2023

kakkokari-gtyih commented Dec 6, 2023 •

edited

Loading

6543 commented Dec 10, 2023

6543 commented Dec 17, 2023

jhass commented Dec 21, 2023 •

edited

Loading

Minoru commented Dec 21, 2023

6543 commented Dec 21, 2023

6543 commented Dec 21, 2023

Add property to refuse indexing #75

Add property to refuse indexing #75

Comments

kakkokari-gtyih commented Dec 6, 2023 • edited Loading

pfefferle commented Dec 6, 2023

kakkokari-gtyih commented Dec 6, 2023 • edited Loading

jhass commented Dec 6, 2023

kakkokari-gtyih commented Dec 6, 2023 • edited Loading

6543 commented Dec 10, 2023

6543 commented Dec 17, 2023

jhass commented Dec 21, 2023 • edited Loading

Minoru commented Dec 21, 2023

6543 commented Dec 21, 2023

6543 commented Dec 21, 2023

kakkokari-gtyih commented Dec 6, 2023 •

edited

Loading

kakkokari-gtyih commented Dec 6, 2023 •

edited

Loading

kakkokari-gtyih commented Dec 6, 2023 •

edited

Loading

jhass commented Dec 21, 2023 •

edited

Loading