Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3089: File tree structures #3089

Draft
wants to merge 3 commits into
base: old_master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions proposals/3089-file-tree-structures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# MSC3089: File trees

Files are currently shared by uploading them to the media repo and putting a reference to that content
in a room message. This is fine for most use cases, such as sharing screenshots or short-lived documents,
however longer term, collaborative, structures are not quite possible.

This MSC defines an approach for defining data trees in Matrix, using a document hierarchy as an example
for how it could be applied.

Reading material:
* [MSC1772 - Spaces + Room types](https://github.com/matrix-org/matrix-doc/pull/1772)
* [MSC3088 - Room subtyping](https://github.com/matrix-org/matrix-doc/pull/3088)
* [MSC1767 - Extensible events](https://github.com/matrix-org/matrix-doc/pull/1767)

Optional but useful reading:
* [MSC1840 - Alternative room types](https://github.com/matrix-org/matrix-doc/pull/1840)
* [MSC2326 - Label-based filtering ("threading")](https://github.com/matrix-org/matrix-doc/pull/2326)
* [MSC2674 - Event relationships](https://github.com/matrix-org/matrix-doc/pull/2674)
* [MSC2676 - Message editing](https://github.com/matrix-org/matrix-doc/pull/2676)
* [MSC2946 - Spaces summary](https://github.com/matrix-org/matrix-doc/pull/2946)
* [MSC2962 - Space group access control](https://github.com/matrix-org/matrix-doc/pull/2962)
* [MSC2753 - Proper peeking](https://github.com/matrix-org/matrix-doc/pull/2753)
* [Spec - Withholding encryption keys](https://spec.matrix.org/unstable/client-server-api/#reporting-that-decryption-keys-are-withheld)

## Proposal

*Author's note: This proposal assumes the reader is familiar with the terminology of the reading
materials mentioned above.*

We introduce a new room subtype, `m.data_tree`, to be applied to spaces to denote that they are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "room subtype" here, should I assume MSC3088 m.room.purpose state events with a state key of m.data_tree? (There's enough MSCs that mention types floating around that I'm having trouble distinguishing them all...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry. Will leave this open as a reminder to clarify.

data-driven trees. The subtype only needs to be applied to a parent space to affect all subspaces
of that space. For a file hierarchy, the room name for the spaces are the directory names. Note
that this subtype is *optional* and serves only to hide the tree from conversation-focused clients.

Spaces used in a tree-like way (with the `m.data_tree` subtype or not) are called "tree spaces" in
this proposal.

The context of the tree space denotes what it is representing. The 3 major expected types are:

1. A standalone data tree. This should be annotated with the `m.data_tree` subtype, and would represent
a shared directory of sorts, possibly shared publicly. This is similar to sending a share link
to a directory in a file syncing service (ie: Dropbox).
2. A data tree as part of a space, but not mirroring the structure of that space. This would also
have the `m.data_tree` subtype, and would best represent a shared drive within that space.
3. The space itself with no subtyping. This usually indicates that the space is structured such that
people can browse files uploaded anywhere for easier exploration. This is expected to be used
in conjunction with case 2. This case would end up potentially replacing the "Files" panel in
many conversational clients.

A limited example of what this would look like is (📂 denotes `m.data_tree` space, `📄` denotes a
file/leaf (described later), and `+` denotes a Space):

```
+ Acme Co.
+ Sales Team
+ 📂 Quarterly objectives
+ 📂 Q1 2021
- 📄 Targets
- 📄 End of quarter report
+ 📂 Q2 2021
- 📄 Targets
+ 📂 Q3 2021
+ 📂 Q4 2021
+ HR
- 📄 WIP: Time off requests v2
+ 📂 Personnel files
+ 📂 Policies
- 📄 Time off requests
+ 📂 Contract templates
```

In the example, the sales team has set up a subspace to hold all of their files and folders ("case 2"
from above). Access control would likely be a subset of Acme Co.'s members, limited to the sales team
space specifically. The HR space has a similar structure, though has decided to use a room which is *not*
subtyped to `m.data_tree` to upload some work-in-progress policies. The HR team also has a shared drive
which would almost certainly have space-defined access control.

In both team's cases, clients would not render the 📂 trees as browseable in a room list (typically). The
client would likely expose a "View files" button which then takes the user to a file browser of sorts
for the user to explore. The WIP policies would likely show up in the "Files" panel of the client, where
a link to explore the 📂 trees.

Tree spaces may contain non-space rooms under them to help perform access control. For example:

```
+ My Folder
- Regular room 1
- Regular room 2
+ 📂 Subfolder 1
```

When this happens, the rooms are treated effectively as more buckets under that parent node. In the above
example this would mean that anything posted to either "Regular room" would be listed under "My Folder"
instead. This is expected to be a rare choice of data structure, though can theoretically be used to
group files within a directory for simpler access control. Note that the "My Folder" space does not need
to be subtyped to have this happen.

Files are represented as room events either in the tree room (or in any non-space room under that tree
space). This is done by exposing a generic `m.leaf` type which is purely intended to be used to encourage
proper rendering within the extensible events scheme.

This intentionally does not use state events to represent each leaf as encrypted state events are not possible
currently. Other MSCs may wish to optimize the lookup of regular room events, though for now the intention
is that clients would parse events themselves.

A file would look something like this (when using extensible events):

```json5
{
"type": "m.file",
"content": {
"m.text": "targets.docx (12 KB)",
"m.file": {
"url": "mxc://example.org/abc123",
"name": "targets.docx",
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"size": 12000
},
turt2live marked this conversation as resolved.
Show resolved Hide resolved
"m.leaf": {}
}
}
```

Note that this would allow non-file types to be included in the tree. Clients would filter out anything
that doesn't make sense for their use case, such as ignoring text-only events. Events missing the `m.leaf`
description would be excluded as non-leaf events. No content for `m.leaf` is currently defined - clients
can interpret labels for files/other types through the extensible events format.

The `m.leaf` type is intentionally not used as the event type as fundamentally the user is uploading a file,
not a leaf. The leaf is essentially metadata on the event to describe how it could be rendered by clients
behaving in a suitable way. Otherwise, it's deliberate that the event shows up as a regular file upload in
the room.

Events can be edited to add the `m.leaf` metadata, adding them to the tree.

Since room events can be encrypted, it can mean that the `m.leaf` metadata gets encrypted too. This could
potentially make it harder on clients/servers to find *just* leaf events. As a workaround, clients can
include the `m.leaf` metadata to the encrypted `content` so it can be found by servers. Clients MUST still
include an encrypted copy in the event content, which clients MUST prefer over the plaintext version. As
an example, this could look like (keys will not be accurate):

```json5
// Encrypted event
{
"type": "m.file",
"content": {
"algorithm": "m.megolm.v1.aes-sha2",
"ciphertext": "Awga...oEkC",
"device_id": "UCCUUHBQQM",
"sender_key": "Vn+E+aPjvlbf14j1OWCIe5IlkTLZ4Zft628Mw8RysG4",
"session_id": "uXWJgrndwkutoKQVqsTsdamRDKqBAkgBawjeqaB+81s",
"m.leaf": {}
}
}
```

```json5
// Decrypted copy of event
{
"type": "m.file",
"content": {
"m.text": "targets.docx (12 KB)",
"m.file": {
"url": "mxc://example.org/abc123",
"name": "targets.docx",
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"size": 12000
},
"m.leaf": {
"com.example.custom_field": true
}
}
}
```

Note how the encrypted event excludes the custom field but the decrypted copy does not. This is to ensure
there is no unnecessary disclosure of information. Clients MUST NOT trust the `m.leaf` in the encrypted
event and must only consider the decrypted copy's `m.leaf`. This is to ensure that an `m.leaf` is *always*
present on an event that needs it, as some clients might optimize out the `m.leaf` without carrying it over.

**TODO: Decide on index versus the above (`m.leaf` accessible by server). Index is below.**

The client is expected to maintain a "branch" structure in the room state, denoting the active files and
where to find those files. This is done through a `m.branch` state event, where the state key is the event
ID of the file. An `m.branch` event looks like this:

```json5
{
"type": "m.branch",
"state_key": "$event",
"content": {
"active": true
Copy link
Member Author

@turt2live turt2live Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for mostly myself: we include a file name here to make simple edits easier.

edit: as name: string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also locked: true for file locking

}
}
```

When `active` is not exactly `true`, the file is considered invalid/inactive. Clients should ignore inactive
files. Clients should take reasonable efforts to resolve the latest version of a file: an edited file event
shouldn't need `m.branch` switching.

For some common operations:
* Deleting a file would mean redacting the event.
* Updating a file could mean editing it, or redacting and re-sending.
* Changing view file permissions could mean using an encrypted room and withholding keys.
* Changing upload permissions would mean altering power levels.
* Renaming a file would mean editing the label.
* Moving a file would mean redacting and re-sending in the right tree.
* Comments/notes on a file could be threads off the file.
* Anonymous browsing would be peeking into the various rooms.

Implementation-wise, the following may be useful:
* Using space summaries to render the directory structure.
* Peeking to get file listings.
* Group access controls to control who can (and can't) upload/view files.
* Encryption to protect files and get finer control over visibility.
* History visibility and join rules to manage publicity of the files.
* Room directory for discovering public file shares.

## Potential issues

***TODO***

## Alternatives

***TODO***

## Security considerations

***TODO***

## Unstable prefix

While this MSC is not in a stable version of the specification, implementations should
use `org.matrix.msc3089.` in place of `m.` - this means, for example, `org.matrix.msc3089.leaf`
as an identifier.