Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2291: Configuration to Control Crawling #2291
base: old_master
Are you sure you want to change the base?
MSC2291: Configuration to Control Crawling #2291
Changes from 3 commits
1cf962a
deb6cb5
c920c9f
24061a9
9b52343
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be applied to homeservers as well. I.e, for a homeserver to join a room, they must agree to keep room data confidential to the best of their ability. Homeserver owners would have to confirm that they do not mine data from rooms to be able to join rooms with such an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but that would probably be the subject for a different MSC. Though I think it would be hard to encode what policies a homeserver admin needs to agree with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these get defined? Do common ones get speced? Is it purely based on "others used this" or why not do this via a mxid? Basically how do I know what to look for when I build a bot? As having this for all bots individually would probably make this hard to use as admins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names would be using the Java package naming convention, but bot authors/admins would declare the name(s) that they use. So, for example, Travis could decide that his Voyager bot uses the
io.t2bot.voyager
name. Your server stats bot could then use either bothio.t2bot.voyager
anddev.nordeganken.serverstats
, if you think that its behaviour is close enough to the original behaviour, or justdev.nordeganken.serverstats
if you think that its behaviour is sufficiently different that it should no longer be considered as a Voyager bot.This is similar to how web crawlers define their own
User-agent
when checkingrobots.txt
.Perhaps when we get extensible profiles, we can add something in there so that bots can declare which names the bot uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah that makes sense. That just leaves me at how a) a room admin and b) new bot writers that arent as connected as we are get to know these keys :) Thats kinda one flaw I see here. While I see this MSC as necessary it is only as effective as the amount of known bot names/names bot filter for. For robots.txt it is pretty much solved by having lists for this and I am not sure how to solve this in spec considering the amount of time spec changes take. So maybe this should be part of appendix or even better a key in the bots entry for the "Try matrix now" page? (to make those keys somewhat discoverable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to best avoid displaying a room that isn't allowed, it would be nice if the
m.room.robots
content would also be included in the/publicRooms
room directory response. Probably under therobots
key for aPublicRoomsChunk
.Otherwise, a stateless app navigating the room directory has to make a request for each room to determine whether it's allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be more in favour of adding it to the stripped state for the room, and exposing stripped state properly on
/publicRooms
. The extensibility ofPublicRoomsChunk
doesn't scale.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great to me 👍
Stripped state spec docs for reference: https://spec.matrix.org/v1.7/client-server-api/#stripped-state
m.room.history_visibility
would be another good one to add but seems like that would fall better under a separate MSC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of 2023 late october this problem came up again. It has become desirable to be able to opt out of aggregated room directory searches where you aggregate results from multiple room directories.
As a way of being able to use the robots event in this context a querry param could be used to ask only for rooms that allow them selfs to be returned in aggregated searches. Allowing the creation of a distinction like showing up on google search and being public on your website. But in this case it would be showing up in direct searches like those current gen clients do but being invisible to aggregated searches powered by spiders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between
messages
andlog
? Does there need to be two, or is one permission for messages in general fine?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that with just
messages
, but notlog
the bot can process the room's messages, but cannot display them to end users. For example, say that the bot is part of some room searching thing, and you ask it for rooms related to "cats". For rooms that just havemessages
enabled, it can say "Here are some rooms that I think are related to cats". For rooms that have bothmessages
andlog
enabled, it can say "Here are some rooms that I think are related to cats, and here are some messages from the room that are about cats". (I don't think it makes sense to havelog
enabled, andmessages
disabled.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And in fact, the MSC does say that if
messages
isfalse
, thenlog
isfalse
.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the custom ones need to be named in a different format? Aka if I remember the "Java package naming convention" the above proposed ones do not follow this scheme. Or am I missing something? It feels inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could use
m.*
for the pre-defined keys. The format used here is consistent with what we do with room events, but aside from that, I'm fine with either way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it'd be easier for bot authors to parse
io.t2bot
orio.t2bot*
here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that
io.t2bot*
would imply thatio.t2botfoobar
would use that key as well, andio.t2bot.*
is unclear whether or not a bot named simplyio.t2bot
should use that key. But I'm largely indifferent to this issue, and anyone with strong opinions should give a good reason for one way or the other.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined with the other comment I made: Would this apply only to the voyager bot running on t2bot or would this apply to all voyager type bots? aka is this a type or a user? Especially as most voyager bots around do slightly different ways of crawling. So a room admin might be fine with the one by travis which only looks into future messages after join while the admin doesnt want to allow mine because my bot also looks into the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mostly answered this in #2291 (comment) . Your bot could look up both
io.t2bot.voyager
anddev.nordeganken.serverstats
, preferringdev.nordeganken.serverstats
. So if room admins could declare a config for justio.t2bot.voyager
, in which case all Voyager-type bots would use the same config. Or they could declare a config for bothio.t2bot.voyager
anddev.nordeganken.serverstats
, in which case your bot would usedev.nordeganken.serverstats
for keys defined there andio.t2bot.voyager
for other keys, and other Voyager-type bots would only use theio.t2bot.voyager
config.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The room summary API's from MSC3266 could cover the federated peeking niche.
Would just need to layer on the
m.room.robots
info in arobots
key (same as the room directory)(shout out to @turt2live for pointing MSC3266 out)