-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2291: Configuration to Control Crawling #2291
base: old_master
Are you sure you want to change the base?
Changes from all commits
1cf962a
deb6cb5
c920c9f
24061a9
9b52343
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
# Configuration to Control Crawling | ||
|
||
Since Matrix is decentralised, there is no single directory where all rooms are | ||
listed. Some people are trying to solve this by creating bots that crawl | ||
public Matrix rooms to list in a directory, giving users a place where they can | ||
search for rooms. This is similar to how users rely on search engines to find | ||
web pages. | ||
|
||
However, although a room might be publicly available, room administrators might | ||
not want the room to be indexed, or may not want certain aspects of a room to | ||
be crawled. With web pages, the site owner can specify their preferences to | ||
crawlers using a [file placed in a well-known | ||
location](https://en.wikipedia.org/wiki/Robots_exclusion_standard). | ||
|
||
This proposal defines a way in which crawling and indexing preferences can be | ||
expressed for Matrix rooms. | ||
|
||
|
||
## Proposal | ||
|
||
For the purposes of this proposal, each bot should be given a name (or names) | ||
following the Java package naming convention. For example, the Voyager bot | ||
from t2bot.io could use the name `io.t2bot.voyager`. | ||
|
||
A new room state event `m.room.robots` is used to define what bots are allowed | ||
to index the rooms, and what data they are allowed to fetch and store from the | ||
room. The event is an object whose values are configuration objects, which are | ||
a map from parameter name to parameter value. Bots should use the | ||
configurations based on their name: when a bot wants to get a parameter from | ||
the configuration: | ||
|
||
- it checks if the `m.room.robots` state has a key that matches its name, and | ||
if the associated configuration object has a key for the parameter that it is | ||
looking for. If it exists, then it uses that value. | ||
- If the state does not have a key that matches its name, or the configuration | ||
object does not contain the parameter in question, then the bot strips off a | ||
component from its name, and looks for a configuration object using that | ||
name. | ||
- This is continued until the bot finds a parameter, or until it has stripped | ||
off all the components from its name. If no parameter value has been found, | ||
then the bot will check if the state has a key of `*` that has the parameter | ||
configured, and if so, will use that value. | ||
- Otherwise, it will use the default value for that parameter. | ||
|
||
A bot may have multiple names that could be applicable to it. For example, if | ||
uhoreg.ca ran an instance of the Voyager bot, then the configuration for both | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do these get defined? Do common ones get speced? Is it purely based on "others used this" or why not do this via a mxid? Basically how do I know what to look for when I build a bot? As having this for all bots individually would probably make this hard to use as admins There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The names would be using the Java package naming convention, but bot authors/admins would declare the name(s) that they use. So, for example, Travis could decide that his Voyager bot uses the This is similar to how web crawlers define their own Perhaps when we get extensible profiles, we can add something in there so that bots can declare which names the bot uses. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok yeah that makes sense. That just leaves me at how a) a room admin and b) new bot writers that arent as connected as we are get to know these keys :) Thats kinda one flaw I see here. While I see this MSC as necessary it is only as effective as the amount of known bot names/names bot filter for. For robots.txt it is pretty much solved by having lists for this and I am not sure how to solve this in spec considering the amount of time spec changes take. So maybe this should be part of appendix or even better a key in the bots entry for the "Try matrix now" page? (to make those keys somewhat discoverable) |
||
`io.t2bot.voyager` and `ca.uhoreg.voyager` could be applicable. In this case, | ||
the bot should order the two names in some way, check the configuration using | ||
one name, and if no value is found, to check the configuration using the next | ||
name. This can also be done with multiple names. In general, the names should | ||
be ordered from more specific to more general, so in this case, | ||
`ca.uhoreg.voyager` would be checked first, then `io.t2bot.voyager`, and | ||
finally `*`. | ||
|
||
Parameters defined in this proposal are: | ||
|
||
- `allow`: (boolean) whether the bot is allowed to crawl the room. If `false`, | ||
then the bot may not display any information about the room to users who are | ||
searching its directory, and may not store any information about the room | ||
other than its existence and its crawling preferences. The bot should also | ||
Comment on lines
+58
to
+60
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In order to best avoid displaying a room that isn't allowed, it would be nice if the Otherwise, a stateless app navigating the room directory has to make a request for each room to determine whether it's allowed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be more in favour of adding it to the stripped state for the room, and exposing stripped state properly on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds great to me 👍 Stripped state spec docs for reference: https://spec.matrix.org/v1.7/client-server-api/#stripped-state
Comment on lines
+58
to
+60
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the context of 2023 late october this problem came up again. It has become desirable to be able to opt out of aggregated room directory searches where you aggregate results from multiple room directories. As a way of being able to use the robots event in this context a querry param could be used to ask only for rooms that allow them selfs to be returned in aggregated searches. Allowing the creation of a distinction like showing up on google search and being public on your website. But in this case it would be showing up in direct searches like those current gen clients do but being invisible to aggregated searches powered by spiders. |
||
avoid joining the room, or leave the room if it has already joined. If `true`, the bot | ||
may index the room, and may store and display the room's ID, name, avatar, | ||
aliases, canonical alias, topic, encryption status, join rules, and history | ||
visibility. Some other aspects of the room are controlled by specific | ||
parameters. Other aspects that are not listed above, nor controlled by a | ||
different parameter, are left to the discretion of the bot owner, but in | ||
general should err on the side of privacy. Default: `true` if the | ||
`m.room.join_rules` is `public`, and `false` otherwise. | ||
- `members`: (boolean) whether the bot is allowed to index the room's members. | ||
This includes members' Matrix IDs, display names, and avatars. Default: | ||
`true` if `m.room.join_rules` is `public` and `false` otherwise. | ||
- `messages`: (boolean) whether the bot is allowed to index the room's | ||
messages. Default: `true` if `m.room.history_visibility` is | ||
`world_readable`, and `false` otherwise. | ||
- `log`: (boolean) whether the bot is allowed to display logs of the room to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the difference between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The difference is that with just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
(And in fact, the MSC does say that if |
||
users. This will be `false` if `messages` is `false`. Default: `true` if | ||
`m.room.history_visibility` is `world_readable`, and `false` otherwise. | ||
- `follow`: (boolean) whether the bot is allowed to follow links to other | ||
rooms. This will be `false` if `messages` is `false`. Default: `true` if | ||
`m.room.history_visibility` is `world_readable`, and `false` otherwise. | ||
|
||
Bots may use other parameter names, but the names that are not listed in the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would the custom ones need to be named in a different format? Aka if I remember the "Java package naming convention" the above proposed ones do not follow this scheme. Or am I missing something? It feels inconsistent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we could use |
||
Matrix spec must be namespaced following the Java package naming convention. | ||
|
||
Example: | ||
|
||
Suppose a room with `m.room.join_rules` set to `public`, and | ||
`m.room.history_visibility` set to `world_readable` has the following | ||
`m.room.robots`: | ||
|
||
```json | ||
{ | ||
"*": { | ||
"members": false | ||
}, | ||
"io.t2bot": { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder whether it'd be easier for bot authors to parse There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that |
||
"allow": false | ||
}, | ||
"io.t2bot.voyager": { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Combined with the other comment I made: Would this apply only to the voyager bot running on t2bot or would this apply to all voyager type bots? aka is this a type or a user? Especially as most voyager bots around do slightly different ways of crawling. So a room admin might be fine with the one by travis which only looks into future messages after join while the admin doesnt want to allow mine because my bot also looks into the history. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I mostly answered this in #2291 (comment) . Your bot could look up both |
||
"allow": true, | ||
"io.t2bot.foo": "bar" | ||
} | ||
} | ||
``` | ||
|
||
In this case, the Voyager bot would be allowed to index the room, no other bots | ||
from t2bot.io would be allowed to, but any other non-t2bot.io bots would be | ||
allowed to. No bots would be allowed to index the members, since that is | ||
specified in the configuration for `*`. All bots would be allowed to index | ||
messages and show logs to users, due to the history visibility settings (except | ||
for non-Voyager t2bot.io bots, since they are not allowed to index anything). | ||
Voyager additionally has a custom parameter of `io.t2bot.foo` defined. | ||
|
||
|
||
## Tradeoffs / potential issues / notes | ||
|
||
There are many aspects of a room whose crawling could potentially be controlled | ||
by individual parameters. This proposal attempts to strike a reasonable | ||
balance between allowing administrators control over crawling, and avoiding too | ||
many configuration options. Thus the parameters mainly target the parts | ||
of the room that are the most privacy-sensitive. | ||
|
||
As mentioned above, not all parts of the room are covered by configuration | ||
parameters. In this proposal, we trust bot owners to use their judgement in | ||
determining what is acceptable or not. Given that the preferences expressed in | ||
the room state are purely advisory, and the bot could just ignore the | ||
preferences, this is not seen as a security issue. However, bot owners are | ||
advised that if there is doubt whether some information should be indexed, that | ||
they should err on the side of privacy. Bots can also use the existing | ||
parameters to inform their decision on whether to index certain information. | ||
For example, a bot that tracks which web pages are linked to from various | ||
Matrix rooms might use the `log` and/or `follow` parameters to determine | ||
whether to process links in a certain room, depending on what it does with that | ||
information. Bots are also able to define their own paramaters to control | ||
certain parts of their indexing, if the existing parameters are not sufficient. | ||
|
||
If allowed, bots may peek into the room to examine the `m.room.robots` state to | ||
determine whether they are allowed to index the room; a bot that is not allowed | ||
to index the room may not want to join the room. However, bots may not be able | ||
to peek in rooms that its server is not already a part of until | ||
[MSC1777](https://github.com/matrix-org/matrix-doc/pull/1777) is fixed. | ||
Comment on lines
+139
to
+141
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The room summary API's from MSC3266 could cover the federated peeking niche. Would just need to layer on the (shout out to @turt2live for pointing MSC3266 out) |
||
|
||
Clients can display the `m.room.robots` state to users to notify them of the | ||
crawling and indexing preferences of the room. This proposal does not attempt | ||
to define how this information is displayed to the user. | ||
|
||
Individual users may have preferences on whether bots index their messages or | ||
their membership in a room. This proposal does not address that issue, but it | ||
might be able to be addressed by using a similar method in combination with | ||
[MSC1769](https://github.com/matrix-org/matrix-doc/pull/1769). | ||
|
||
|
||
uhoreg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Security considerations | ||
|
||
The configuration information is purely advisory, and should not be relied on | ||
for security since bots can simply ignore the configuration. | ||
|
||
|
||
## Unstable prefix | ||
|
||
Until this lands in the spec, the state event type | ||
`org.matrix.msc2291.room.robots` should be used in place of `m.room.robots`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be applied to homeservers as well. I.e, for a homeserver to join a room, they must agree to keep room data confidential to the best of their ability. Homeserver owners would have to confirm that they do not mine data from rooms to be able to join rooms with such an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but that would probably be the subject for a different MSC. Though I think it would be hard to encode what policies a homeserver admin needs to agree with.