-
-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The commonmark renderer's escaping strategy is very aggresive #131
Comments
Previously they actually ran cmark instead of the round-trip version, since there was a bug in setting the ROUNDTRIP variable. Now round trip tests fail! This was unnoticed before. See #131.
Good catch! Round-trip tests were failing even without any change to escaping. It was not noticed because of a bug in the CMakeLists.txt file. I have fixed this issue. |
I'm sympathetic to the general concern, as well. Certainly we should try not to add escapes when it isn't necessary. |
Good to know you're interested! I am currently in the (slow and boring process) of converting around 40 50 pages+ docx files that we used to communicate with our client to commonmark (+ hotdoc), which will let us use standard tools to review proposed changes among many other pretty obvious advantages, anyway once that is done I expect our need for this will become more immediate (and also for reference links to make the roundtrip, but there's already an issue open for that and that should be trivial). In the meantime, we should continue trying to figure out the problematics for each of these symbols, ideally with examples. You say |
I don't understand this comment:
I don't see the behavior you're describing.
Note that you're not getting a break after "getting" here. |
Gimme a sec I'll try to reproduce, I had a case that was blatantly wrong but I made that example on the spot and it would require a specific width (which I did not calculate) to make it obvious |
Would be very nice if you did hang out in some IRC channel btw :P |
Hm I'm sorry, I should learn to read the output of
I thought the new file was at the bottom, consider I said nothing :) |
By the way the bottom one is a quirk of pandoc, I often see that at the bottom of files it translates to markdown |
If you can find a reproducible example, submit it as a bug +++ Mathieu Duponchelle [Jun 01 16 15:14 ]:
|
Will do |
I have
so now all tests pass again. |
Back to the issue of aggressive escaping of
I can think of two general solutions here.
How does (2) sound to you? I think this would yield a fully adequate solution to the problem of overly aggressive escaping for |
I will need to do some context switching and look back at the code to give a more informed opinion, my turn to have busy weeks :) If we decide to tackle an overall refactoring of the commonmark rendering process, I think proving as formally as possible that "unnecessary" escaping of emphasis and link symbols indeed is necessary, and why. My use case for this is to have the renderer be as passthrough as possible, then offer some switches to enable enforcing style conventions. For example, bullet lists should use either '*' or '-' depending on the original text, except if the user specifies that all bullet lists should be styled in the same way. Ideally, I'd like to be able to use cmark as a git pre-commit hook :) |
+++ Mathieu Duponchelle [Jun 24 16 11:10 ]:
I think it's pretty clear that unnecessary escaping will be
On this, see #125. However, there are lots of other things |
The more I think about this, the more I reach the conclusion that the only possible fix would be to adopt a somehow different parsing strategy, with actual tokenization, and a second pass comparable to what clang does with @jgm , what made you give up on the strategy you used in https://github.com/jgm/peg-markdown/blob/master/markdown_parser.leg ? |
@MathieuDuponchelle - I'm not sure what part of peg-markdown you think addressed the particular problem in this issue. peg-markdown constructs an AST just like cmark, so in rendering it to markdown again we'd face the same issues about escaping. peg-markdown is a lot slower than cmark and falls down on some of the pathological tests (exponential behavior). It also has some parsing flaws, though off the top of my head I can't recall what they are. (Obviously, also, it doesn't parse to the CommonMark spec.) Perhaps instead of inserting string literals into the AST, we should insert pairs of offsets into the original source text (multiple pairs would be needed, in general, because the string contents of an AST node aren't always continguous in the source). Then we could peek into the source to determine things like which emphasis character was used, whether a character was escaped, etc. Was that the sort of thing you had in mind? |
Not necessariy, the reason I mentioned peg-markdown was that it seems to use a more "classical" approach to parsing, though now that I read the source a bit more it doesn't seem to have a strictly separated lexing phase. The sort of thing I have in mind would be a first phase where lexical tokens are extracted, from which the original text could still be exactly reproduced, and a grammar parsing phase, similar to what the clang_annotateTokens function does. It would then be possible to modify the AST and dump it again, and as far as I can see only the modified bits might have to be escaped. As I said, that's pretty high level :) |
+++ Mathieu Duponchelle [Nov 22 16 14:37 ]:
The problem is that with Markdown there's very little you Specific suggestions welcome, of course. |
That's for sure, but there shouldn't be a problem with having these tokens constitute a subsequent Text ast node right ? And in the case where they are interpreted as having semantic meaning and constituting another kind of node, then we'd still have the "source mapping" available. |
+++ Mathieu Duponchelle [Nov 22 16 16:14 ]:
Sorry, I'm a bit lost about what precisely you're proposing. |
Understandable, as I haven't been very specific, seems I'm gonna have to be, at the risk of looking dumb, damn :P So the way I would see this working is given this input:
the "tokenized representation" would be (pseudo enum):
The ast would obviously look like:
The process followed by the commonmark writer would be as followed:
If the API was used to update the link destination, there should be logic to update the token list, in that case STR, with a new set of tokens. That way, as I see it, formatting (including whitespace), could be preserved, thus making the commonmark writer appropriate for source to source transformation. In the above example, the link node would be aware of the extent of the tokens that constitute its destination, for example given:
The token list would be:
and the destination's extents would be:
Updating the destination would thus replace these three tokens with new ones. |
Actually the correct approach for rendering back to commonmark, now that I think of it a bit more, would simply be to dump back the entire, potentially updated list of tokens. Bonus points if there's a notion of "dirtiness" on this list of tokens, to allow updating the source positions once after a set of changes. Other example:
With the proposed approach, it wouldn't matter that the list of tokens making up the link destination are not contiguous with the rest of the link tokens, and reference links would be preserved. @jgm, am I making sense here? |
+++ Mathieu Duponchelle [Nov 24 16 16:07 ]:
Actually the correct approach for rendering back to commonmark, now
that I think of it a bit more, would simply to dump back the entire,
potentially updated list of tokens.
You say "rendering back." But the commonmark renderer
doesn't presuppose that the node tree it gets as input came
from commonmark source. It might have been constructed
programatically, or parsed from LaTeX, for example.
|
Hm, that's indeed the use case I've got in mind, I guess it shows :) However as I see it, there would be no obstacle to constructing the entire ast programatically with this approach, I'm not sure what you mean by "parsed from LaTeX" however, is there any intention to have cmark support multiple input formats ? |
+++ Mathieu Duponchelle [Nov 25 16 08:41 ]:
Hm, that's indeed the use case I've got in mind, I guess it shows :)
However as I see it, there would be no obstacle to constructing the
entire ast programatically with this approach, I'm not sure what you
mean by "parsed from LaTeX" however, is there any intention to have
cmark support multiple input formats ?
I have in fact already used cmark with HTML input:
https://github.com/jgm/html2cmark
This goes via lua bindings to cmark. We use an HTML parser
to parse HTML5, then go through the parse tree and construct
a cmark node tree, then render it using the commonmark
renderer.
|
Right, so that's also programmatic building of the AST, and you wouldn't a priori care about using libcmark to do html -> html passthrough I guess ;) |
To be clear, nothing prevents from having a commonmark output of a programatically built AST using the approach I tried to lay out:
|
btw
This could very well work :) As far as I can tell, we could not do a generic "offsets" field in the node structure, but instead would need node-specific code. I think the first step would be to not strip stuff and forget about it during the block parsing phase, like whitespaces, new lines or trailing hashes in heading lines, for example:
The heading node would hold two new fields: heading.begin_marker = (0, 2) -> '# ' heading.content could continue holding 'foo', at least for now. The clear advantage of this approach is that it's not intrusive, and can be implemented incrementally, I'll try that once I get the occasion :) |
I know that this issue is quite old at this point, but is there any chance we could renew attention on it? I recently ended up writing my own "Commonmark Re-renderer" because of the aggressive escaping that cmark performs when outputting a parsed markdown AST back into markdown proper. It would be great if the Better yet, it would be great if there were a built-in way to output back to commonmark with no escaping. Essentially what I'd like is to be able to parse Commonmark with certain options (e.g. smart punctuation and unsafe HTML) and then be able to re-output markdown (with transformations from those options still applied) that can be re-parsed to produce the same AST, regardless of whether or not there are |
This commit should help with
Unfortunately the current node-by-node rendering system doesn't make it easy to improve this further. |
Dang, that's some fast turnaround! I appreciate any improvement, even if there's further to go. Thank you for taking a quick chip at it! |
I am currently evaluating the usefulness of cmark as an indentation / formatting tool.
The width parameter makes cmark already quite suitable, though I would remove all soft breaks and normalize before rendering in case HARDBREAKS wasn't passed, this would avoid such cases:
rendering to
Anyway, my real issue here is that a lot of characters are preemptively escaped, even though they could actually pass through unescaped. I understand that the initial intent was (and rightly so) "better safe than sorry", but I think we should now look back at how to be more clever wrt these characters, starting with the most likely to be used in a common sentence.
Case in point:
!
Currently given that input:
we have this output when running
cat hello.md | ./build/src/cmark -t commonmark
This was introduced by efeb709. The specific reason for this is not stated in the commit message, but when resetting cmark to the previous commit, and running make roundtrip_test , it becomes clear:
Furthermore, some code was obsoleted in this commit (but still lives in the current HEAD):
This should be removed, but helps us understand the issue at hand, and the shortcomings of the current strategy:
The previous code tried to determine whether
!
should be escaped by looking at the next character in the literal stream it was parsed from, bounded by its containing block limits. However here, the next character is NULL, and the issue was triggered by the next character we output, which was '['.The fix that was introduced for this escaped all
!
, end of story. However I think a more appropriate fix would be to escape it "retroactively": the only case I'm aware of where '!' can have a semantic meaning is when it is followed by '[', as such we should, in the case where the renderer ends up outputting it on its own (case CMARK_NODE_LINK), be able to verify that the previous character that was output was a "!", and only then insert an escape character in the strbuf. We should obviously make sure to escape it only if it was contained in a TEXT node, but that shouldn't be too hard.I haven't studied the code further yet, because I thought this was the most glaring issue, however I also strongly dislike the underscore escaping, because it requires function_names to be wrapped in a code inline in order to escape escaping.
We should determine the conditions under which most if not all of these characters need to be escaped, and try to determine if we can come up with a better strategy, in order to make cmark more usable as a source-to-source tool, I hope you folks can help me study each case :)
The text was updated successfully, but these errors were encountered: