-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RTE inserts spurious backslashes into URLs (including room aliases, etc) in code blocks #4809
Comments
It's not just rooms. It's a bunch of situations with underscores. |
Only happens when something gets linkified because we escape URLs in MD, regardless of whether they're in a code block. So actual = expected behaviour outside of code blocks only. |
@lukebarnard1 came to discuss this IRL - we floated the idea of
|
it feels wrong to be making the user figure out whether the URL should be escaped or not. The user really shouldn't have to be making that sort of decision. The "right" solution seems to be that we shouldn't be escaping underscores in URLs in code-blocks. Is the reason for dismissing that that it's Hard? |
I think I need to talk to @lukebarnard1 about this one again when he gets back... |
Basically, yes. The solution that involves parsing the MD and only escaping URLs outside of code blocks seems tractable but fiddly. The alternative being we un-escape stuff in code blocks regardless of what's in them, but that sounds worse. |
so urls like |
okay, so we've broken one thing (URLs in codeblocks) with a somewhat questionable fix for another thing (escaping underscores in random inline URLs). I don't think we should be inventing our own MD dialect by escaping underscores in the first place. If users are in MD mode, we should accept normal commonmark rather than trying to pass it through a custom preprocessor. In commonmark, http://google.com/_thing_ is indeed italics, and users need to know to put it in backticks to avoid it. In practice Joe Public should be using the RTE rather than markdown anyway, leaving MD to the geeks who should be aware of this, imo. Another solution would be to use a better MD dialect than commonmark - github flavoured markdown gets the http://google.com/_thing_ right without needing backticks. But of course we moved away from GFM to commonmark in order to be more consistent cross-platform. So, TL;DR: i think we should remove the escaping across the board, as adding fixes which introduce new problems are probably the wrong fixes in the first place. |
ftr, the thing that we fixed was #3428. In particular it was painful that URLs like |
Does that count as a bug in common mark, given how unlikely it is that the italics would be the intended behaviour? Perhaps you could persuade them to fix? |
@maxgrenderjones |
Which gfm somehow handles sanely fwiw |
So after further discussion this morning, my proposal doesn't work given how common underscores are in our URLs, as Rich points out. I see three possible solutions:
Thoughts? |
I like option 1 regardless of this issue. It kinda sucks that the migration path would be manual and time consuming, if a migration path is desired. GFM sounds like a reasonable choice, as it is reasonably well known, reasonably documented, and mostly okay. The changing spec and lack of updates is a bit worrying though. If web goes with GFM, I'd also consider it reasonable to have the mobile apps not be GFM for the sake of resources. GFM gives more features, but users are (hopefully) unlikely to do any serious markdown in the mobile apps. Just my 2c. |
I'd be for 1. All the _s are an eyesore and matrix look unnecessarily weird & non-standard |
i'm also torn between whether errant escaped underscores in codeblocks is more or less offensive than errant italics in room aliases... |
I wouldn't be against 1, but (a) I'm not sure dots are very much better; (b) migrating sounds very painful. The absence of maintained parsers for GFM makes it something of a non-starter for me. It's certainly a good step in the right direction that there is a spec, but there aren't any parsers I really trust. Tables are nice but feel overspecced for an instant messenger. Strikethrough is I'd be interested to know how much work it would be to patch the CM parsers. I'd hope it wouldn't be too much. Speaking a slightly different variant doesn't sound awful to me. |
i'm seeing yet more errant escaping going on unrelated to backslashes:
ends up getting rendered as:
and I have no idea where that backslash is coming from >:( |
do you mean underscores? In general it's known that the problem isn't limited to underscores; though I must say your example above seems a particular mystery. |
i would really like to fix this by returning to GFM rather than inventing our own dialect of markdown, especially as GFM has a spec now: https://github.github.com/gfm/. does anyone object to actually doing this? |
The reason for this is:
Whereas the only problem I actually know we had with GFM is that it turned -_- into an hr - which empirically github and other GFM editors get right these days. Plus marked didn't have a proper html parser, but I'm not sure we actually need it in the end? @dbkr, wdyt? |
Given GFM spec is built on top of commonmark, I don't think your problems go away entirely. Granted autolinks means that |
untrue: #_freenode_matrix:matrix.org (no escapes) works fine. |
I stand corrected |
(i think it used to get this wrong a few years ago before we switched to commonmark, but happily GFM has been updated since then - presumably because of matrix room aliases ;P) |
Ftr, matrix-org/matrix-react-sdk#575 summaries the reasons we switched to commonmark - now it has a spec, one of those is invalid. I share @richvdh 's concern about lack of parsers (presumably the lack of an android impl is still a concern?) I would say that all of the things we've done to commonmark are all presentation-level things, ie. you can write the same markdown and it will still mean the same thing, so I'd say we are still using standard commonmark over the wire (apart from the whitelisted tags). Also bear in mind that I wrote a bunch of extra stuff on top of gfm, eg. to avoid it wrapping everything in an extra layer of html tags and to disable linkification (see stuff removed in matrix-org/matrix-react-sdk@4d29264) so moving to GFM is probably still not without its foibles. I think ideally I would lean towards fixing commonmark to follow the way GFM does formatting suppression, which seems like a well defined and achievable goal, with the hope of maybe nudging the commonmark spec in that direction? The main thing drawing me to this is that basically the dialect over the wire stays the same. |
riot-android is still using a JS implementation of GFM right now :L |
Isn't this whole issue about an entirely non-presentation-level thing? That if you say moo_foo_ then with commonmark you get moofoo, but with our mutant commonmark, if you say #moo_foo:matrix.org it's magically not mangled? (Unless it's inside a code block of some kind, where it then sprouts backslashes everywhere)? |
Well, I was classing that a a presentation-level thing on the assumption that the semantics are normally well defined for those things and we just have to display them correctly, in the sense that it's probably a safe assumption nobody's saying #moo_foo_:matrix.org and intending for the foo to be underlined. As opposed to something like strikethrough being |
It does look as though GFM has moved on somewhat, and the arguments against it are now weaker than they were before - there are now new JS parser impls, and it's not hard to imagine a java one appearing in less time than it has taken us to fail to move android over to CM. However, as Dave says, GFM is not without its foibles, and needed some customisation too. Also, I am not keen on vacillating between formats every year. If we're seriously considering switching back, I'd find it helpful if somebody could gather (somewhere that isn't github) a document setting out the problems with each format, and the customisations we have made so that we can better understand the background. |
Okay. I've had more of a dig to see if we can fix the actual problem here without shifting to GFM. Findings are:
I think that the breakage caused by introducing backslashes into URLs in code blocks is WAY worse than the fix caused by fairly obscure URLs like http://google.com/_thing_ breaking, so I'm going to roll back matrix-org/matrix-react-sdk#1270 (like I should have months ago). However, I still feel that it's very annoying that commonmark gets this wrong, and there's no easy way to do a partial-parse of the MD to fix it. Patching commonmark to do the right thing looks scary in the extreme, and definitely causes a major fork. So then we're back with me saying "gah, why can't we just use GFM, which gets this right, and has no known disadvantages?". Eitherway, we can close this up by reverting matrix-org/matrix-react-sdk#1270, albeit reopening #4674 |
going to close this as the bug in question is now fixed; suggest future discussion goes on #4674 |
(the |
It shows up in the timeline as
#freenode\_#foo:matrix.org
The text was updated successfully, but these errors were encountered: