-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2191: Markup for mathematical messages #2191
Changes from all commits
d2a9d87
64e3626
d27cfdd
1587a80
6829f05
fd78369
4ad26f8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
# MSC2191: Markup for mathematical messages | ||
|
||
Some people write using an odd language that has strange symbols. No, I'm not | ||
talking about computer programmers; I'm talking about mathematicians. In order | ||
to aid these people in communicating, Matrix should define a standard way of | ||
including mathematical notation in messages. | ||
|
||
This proposal presents a format using LaTeX, in contrast with a [previous | ||
proposal](https://github.com/matrix-org/matrix-doc/pull/1722/) that used | ||
MathML. | ||
KitsuneRal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
See also: | ||
|
||
- https://github.com/vector-im/riot-web/issues/1945 | ||
|
||
|
||
## Proposal | ||
|
||
A new attribute `data-mx-maths` will be added for use in `<span>` or `<div>` | ||
elements. Its value will be mathematical notation in LaTeX format. `<span>` | ||
is used for inline math, and `<div>` for display math. The contents of the | ||
`<span>` or `<div>` will be a fallback representation or the desired notation | ||
for clients that do not support mathematical display, or that are unable to | ||
render the entire `data-mx-maths` attribute. The fallback representation is | ||
uhoreg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
left up to the sending client and could be, for example, an image, or an HTML | ||
approximation, or the raw LaTeX source. When using an image as a fallback, the | ||
sending client should be aware of issues that may arise from the receiving | ||
client using a different background colour. | ||
|
||
Example (with line breaks and indentation added to `formatted_body` for clarity): | ||
|
||
```json | ||
{ | ||
"content": { | ||
"body": "This is an equation: sin(x)=a/b", | ||
"format": "org.matrix.custom.html", | ||
"formatted_body": "This is an equation: | ||
<span data-mx-maths=\"\\sin(x)=\\frac{a}{b}\"> | ||
sin(<i>x</i>)=<sup><i>a</i></sup>/<sub><i>b</i></sub> | ||
</span>", | ||
"msgtype": "m.text" | ||
}, | ||
"event_id": "$eventid:example.com", | ||
"origin_server_ts": 1234567890, | ||
"sender": "@alice:example.com", | ||
"type": "m.room.message", | ||
"room_id": "!soomeroom:example.com" | ||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you show an example of the event JSON that a sending client would use if they were including a fallback for the receiver? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that I understand what you're asking. The example given already includes the fallback. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh right! I had entirely missed that the body of the |
||
|
||
|
||
## Other solutions | ||
|
||
[MSC1722](https://github.com/matrix-org/matrix-doc/pull/1722/) proposes using | ||
MathML as the format of transporting mathematical notation. It also summarizes | ||
some other solutions in its "Other Solutions" section. | ||
|
||
In comparison with MathML, LaTeX has several advantages and disadvantages. | ||
|
||
The first advantage, which is quite obvious, is that LaTeX is much less verbose | ||
and more readable than MathML. In many cases, the LaTeX code is a suitable | ||
fallback for the rendered notation. | ||
|
||
LaTeX is a suitable input method for many people, and so converting from a | ||
user's input to the message format would be a no-op. | ||
|
||
However, balanced against these advantages, LaTeX has several disadvantages as | ||
a message format. Some of these are covered in the "Potential issues" and | ||
"Security considerations". | ||
|
||
|
||
## Potential issues | ||
|
||
### "LaTeX" as a format is poorly defined | ||
uhoreg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
There are several extensions to LaTeX that are commonly used, such as | ||
AMS-LaTeX. It is unclear which extensions should be supported, and which | ||
should not be supported. Different LaTeX-rendering libraries support different | ||
sets of commands. | ||
|
||
This proposal suggests that the receiving client should render the LaTeX | ||
uhoreg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
version if possible, but if it contains unsupported commands, then it should | ||
display the fallback. Thus, it is up to the receiving client to decide what | ||
commands it will support, rather than dictating what commands must be | ||
supported. This comes at a cost of possible inconsistency between clients, but | ||
is somewhat mitigated by the use of a fallback. Clients should, however, aim | ||
to support, at minimum, the basic LaTeX2e maths commands and the TeX maths | ||
commands, with the possible exception of commands that could be security risks | ||
(see below). | ||
|
||
To improve compatibility, the sender's client may warn the sender if they are | ||
using a command that comes from another package, such as AMS-LaTeX. | ||
|
||
### Lack of libraries for displaying mathematics | ||
|
||
see the corresponding section in [MSC1722](https://github.com/matrix-org/matrix-spec-proposals/pull/1722/files#diff-4a271297299040dbfa622bfc6d2aab02f9bc82be0b28b2a92ce30b14c5621f94R148-R164) | ||
|
||
|
||
## Security considerations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've done little with LaTeX, but it does a lot more than just math symbols -- it is a whole typesetting system. This sounds confusing to be embedding into an HTML property, especially since you have to escape backslashes (which are used a lot in LaTeX). I was curious how Wikipedia handled formulas, since they have to render untrusted input as well. tl;dr is that you need to install and use a whole heap of software to do this correctly, including texvc (which uses OCaml), LaTeX itself, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume that most implementations will use MathJax or similar which as far as I know just implements a subset of LaTeX specifically geared towards math. It might be worth explicitly recommending the use of a narrowly-scoped LaTeX rendering library. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this proposal is solely about the math part of TeX/LaTeX, and not about any of the other document processing bits. I can try to clarify it. I do recommend against running the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the proposal is clear, I'm just concerned that it would be an easy vector to add security vulnerabilities to applications.
Looking through the implementations it doesn't seem they attempt to sanitize input or anything -- maybe this is OK though since MathJax and KaTeX only handle math anyway? Looking at the MathJax docs it does seem to allow e.g. defining macros by default. This might just need a big warning in the spec PR that says to be careful, but it seems a bit weird that the spec is very explicit about what HTML tags/attributes to support but here we just shrug and don't give real advice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it looks like MathJax and KaTeX both allow defining macros, but they limit recursion, which is the main issue that macros can cause.
Yeah. Part of the reason here is that there is a huge number of LaTeX commands -- mostly for specific symbols. There are also extensions that define their own commands that some renderers may want to support. Another reason for being lax with specifying what to support is the tooling, or lack thereof. While there are a lot of HTML sanitizers that allow you to specify exactly what's allowed, there is a general lack of LaTeX sanitizers. Instead, the rendering libraries generally provide options for what unsafe things to allow, if any. So if we tried to tell clients what commands to support and what commands not to support, clients authors might need to write their own LaTeX parsers, which would not be pleasant. |
||
|
||
LaTeX is a [Turing complete programming | ||
language](https://web.archive.org/web/20160110102145/http://en.literateprograms.org/Turing_machine_simulator_%28LaTeX%29); | ||
it is possible to write a LaTeX document that contains an infinite loop, or | ||
that will require large amounts of memory. While it may be fun to write a | ||
[LaTeX file that can control a Mars | ||
Rover](https://wiki.haskell.org/wikiupload/8/85/TMR-Issue13.pdf#chapter.2), it | ||
is not desireable for a mathematical formula embedded in a Matrix message to | ||
control a Mars Rover. Clients should take precautions when rendering LaTeX. | ||
Clients that use a rendering library should only use one that can process the | ||
LaTeX safely. | ||
|
||
Clients should not render mathematics by calling the `latex` executable without | ||
proper sandboxing, as the `latex` executable was not written to handle | ||
untrusted input. (see, for example, <https://hovav.net/ucsd/dist/texhack.pdf>, | ||
<https://0day.work/hacking-with-latex/>, and | ||
<https://hovav.net/ucsd/dist/tex-login.pdf>.) Some LaTeX rendering libraries | ||
are better suited for processing untrusted input. | ||
|
||
Certain commands, such as [those that can create | ||
macros](https://katex.org/docs/supported#macros), are potentially dangerous; | ||
clients should either decline to process those commands, or should take care to | ||
ensure that they are handled in safe ways (such as by limiting recursion). In | ||
general, LaTeX commands should be filtered by allowing known-good commands | ||
rather than forbidding known-bad commands. Some LaTeX libraries may have | ||
options for doing this. | ||
|
||
In general, LaTeX places a heavy burden on client authors to ensure that it is | ||
processed safely. Some LaTeX rendering libraries provide security advice, for | ||
example, <https://github.com/KaTeX/KaTeX/blob/main/docs/security.md>. | ||
|
||
|
||
## Conclusion | ||
|
||
Math(s) is hard, but LaTeX makes it easier to write mathematical notation. | ||
However, using LaTeX as a format for including mathematics in Matrix messages | ||
has some serious downsides. Nevertheless, if clients handle the LaTeX | ||
carefully, or rely on the fallback representation, the concerns can be | ||
addressed. | ||
uhoreg marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that the vast majority of users will never send mathematical expression to each other. Is the complexity really worth it? LaTeX is non-trivial to parse nor to render.
Also, if Matrix is going to access mathematical notations, what about other domains, like chemistry, physics, … al the myriad of other notations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clients aren't required to render or parse the notation, which is why a fallback is present. Several clients do wish to represent mathematical expressions to users though, and having a consistent and standardized way to do so is important.
MSCs for other notations are equally accepted, provided they have similar fallback mechanics.