Skip to content

Commit

Permalink
Add reason why the archive bot is joining the room (#262)
Browse files Browse the repository at this point in the history
Using the join `reason` added in [MSC2367](matrix-org/matrix-spec-proposals#2367). Unfortunately, this PR doesn't have much effect because it doesn't look like many clients support it yet (Element doesn't support it for example).

Part of #257
  • Loading branch information
MadLittleMods authored Jun 9, 2023
1 parent 8da9b3d commit 1dd6321
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 21 deletions.
56 changes: 40 additions & 16 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,31 +17,55 @@ And with the introduction of the jump to date API via
[MSC3030](https://github.com/matrix-org/matrix-spec-proposals/pull/3030), we could show
messages from any given date and day-by-day navigation.

## How do I opt out and keep my room from being indexed by search engines?

All public Matrix rooms are accessible to view in the Matrix Public Archive. But only
rooms with history visibility set to `world_readable` are indexable by search engines.

Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better
opt out controls.
## Why did the archive bot join my room?

Only public Matrix rooms with `shared` or `world_readable` [history
visibility](https://spec.matrix.org/latest/client-server-api/#room-history-visibility) are
accessible in the Matrix Public Archive. In some clients like Element, the `shared`
option equates to "Members only (since the point in time of selecting this option)" and
`world_readable` to "Anyone" under the **room settings** -> **Security & Privacy** ->
**Who can read history?**.

But the archive bot (`@archive:matrix.org`) will join any public room because it doesn't
know the history visibility without first joining. Any room without `world_readable` or
`shared` history visibility will lead a `403 Forbidden`. And if the public room is in
the room directory, it will be listed in the archive but will still lead to a `403
Forbidden` in that case.

The Matrix Public Archive doesn't hold onto any data (it's
stateless) and requests the messages from the homeserver every time. The
[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5
minutes for the current day, and 2 days for past content.

For [archive.matrix.org](https://archive.matrix.org/), you can ban the
`@archive:matrix.org` user if you don't want your room content to be shown in the
archive at all.
The Matrix Public Archive only allows rooms with `world_readable` history visibility to
be indexed by search engines. See the [opt
out](#how-do-i-opt-out-and-keep-my-room-from-being-indexed-by-search-engines) topic
below for more details.

## Why does the archive user join rooms instead of browsing them as a guest?
### Why does the archive user join rooms instead of browsing them as a guest?

Guests require `m.room.guest_access` to access a room. Most public rooms do not allow
guests because even the `public_chat` preset when creating a room does not allow guest
access. Not being able to view most public rooms is the major blocker on being able to
use guest access. The idea is if I can view the messages from a Matrix client as a
random user, I should also be able to see the messages in the archive.

Keep in mind that only rooms with history visibility set to `world_readable` are
indexable by search engines. The Matrix Public Archive doesn't hold onto any data (it's
stateless) and requests the messages from the homeserver every time. The
[archive.matrix.org](https://archive.matrix.org/) instance has some caching in place, 5
minutes for the current day, and 2 days for past content.
Guest access is also a much different ask than read-only access since guests can also
send messages in the room which isn't always desirable. The archive bot is read-only and
does not send messages.

## How do I opt out and keep my room from being indexed by search engines?

Only public Matrix rooms with `shared` or `world_readable` history visibility are
accessible to view in the Matrix Public Archive. But only rooms with history visibility
set to `world_readable` are indexable by search engines.

Also see https://github.com/matrix-org/matrix-public-archive/issues/47 to track better
opt out controls.

As a workaround for [archive.matrix.org](https://archive.matrix.org/) today, you can ban
the `@archive:matrix.org` user if you don't want your room content to be shown in the
archive at all.

## Technical details

Expand Down
20 changes: 19 additions & 1 deletion server/lib/matrix-utils/ensure-room-joined.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,19 @@
const assert = require('assert');
const urlJoin = require('url-join');

const StatusError = require('../errors/status-error');
const { fetchEndpointAsJson } = require('../fetch-endpoint');
const getServerNameFromMatrixRoomIdOrAlias = require('./get-server-name-from-matrix-room-id-or-alias');
const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator');

const config = require('../config');
const StatusError = require('../errors/status-error');
const basePath = config.get('basePath');
assert(basePath);
const matrixServerUrl = config.get('matrixServerUrl');
assert(matrixServerUrl);

const matrixPublicArchiveURLCreator = new MatrixPublicArchiveURLCreator(basePath);

async function ensureRoomJoined(
accessToken,
roomIdOrAlias,
Expand Down Expand Up @@ -43,6 +48,19 @@ async function ensureRoomJoined(
method: 'POST',
accessToken,
abortSignal,
body: {
reason:
`Joining room to check history visibility. ` +
`If your room is public with shared or world readable history visibility, ` +
`it will be accessible at ${matrixPublicArchiveURLCreator.archiveUrlForRoom(
roomIdOrAlias
// We don't need to include the `viaServers` option here because the archive
// will already be joined to the room from this request itself and we don't
// need to make the URL any longer/noisier than it needs to be.
)}. ` +
`See the FAQ for more details: ` +
`https://github.com/matrix-org/matrix-public-archive/blob/main/docs/faq.md#why-did-the-archive-bot-join-my-room`,
},
});
assert(
joinData.room_id,
Expand Down
10 changes: 6 additions & 4 deletions test/e2e-tests.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ const chalk = require('chalk');
const RethrownError = require('../server/lib/errors/rethrown-error');
const MatrixPublicArchiveURLCreator = require('matrix-public-archive-shared/lib/url-creator');
const { fetchEndpointAsText, fetchEndpointAsJson } = require('../server/lib/fetch-endpoint');
const ensureRoomJoined = require('../server/lib/matrix-utils/ensure-room-joined');
const config = require('../server/lib/config');
const {
MS_LOOKUP,
Expand Down Expand Up @@ -999,10 +1000,11 @@ describe('matrix-public-archive', () => {
// avoid problems jumping to the latest activity since we can't control the
// timestamp of the membership event.
const archiveAppServiceUserClient = await getTestClientForAs();
await joinRoom({
client: archiveAppServiceUserClient,
roomId: roomId,
});
// We use `ensureRoomJoined` instead of `joinRoom` because we're joining
// the archive user here and want the same join `reason` to avoid a new
// state event being created (`joinRoom` -> `{ displayname, membership }`
// whereas `ensureRoomJoined` -> `{ reason, displayname, membership }`)
await ensureRoomJoined(archiveAppServiceUserClient.accessToken, roomId);

// Just spread things out a bit so the event times are more obvious
// and stand out from each other while debugging and so we just have
Expand Down

0 comments on commit 1dd6321

Please sign in to comment.