Replies: 9 comments
-
I see that the HTML specification on how to parse html code is very specific but I think we're not its target users. Markdown html syntax allows the author to pass some simple html code in the output document as-is, mainly for the purpose of formatting or embedding some content whose type is not supported by markdown itself (e.g. videos, etc.). So I'm okay with fixing the rule to match only tags that start with an ascii letter but I see that allowing too broad a character set for the tag name might lead to unexpected results for the users. For reference, the commonmark spec states:
This is even too much restricting in my opinion, since it doesn't allow for example namespace-qualified tags, such as svg or open graph tags (I guess) |
Beta Was this translation helpful? Give feedback.
-
I’m curious why that would be — there’s a definition for what constitutes a tag name that’s unambiguous and (unlike most aspects of HTML) can be described with a regular grammar. It doesn’t seem to present a challenge in HTML, which is the output of this lib, so I’d say suggest it’s odd to depart from HTML’s definition of an HTML construct unless the usefulness of doing so has been proven. In any case, if using Commonmark’s definition, the current pattern is not a match — it considers (I do understand the desire to conform to Commonmark, though I think it’s too bad that while the original definition of markdown, though fuzzy, permits a reading which makes markdown a superset of HTML, Commonmark doesn’t, both here and in a few other spots.) |
Beta Was this translation helpful? Give feedback.
-
I think I've just answered to your valid points in my previous message:
(this is my opinion and it is not proven, of course)
|
Beta Was this translation helpful? Give feedback.
-
I'd like to add something: I tend to prefer conforming to commonmark above all else, except when I think there's a valid reason not to. For example, I explained in #1036 why I hate their specification for emphasis. The commonmark attribute name rule and example 590 clash with the whatwh specification for attribute name parsing. |
Beta Was this translation helpful? Give feedback.
-
Thanks, the context about goals/past decisions is helpful. (I had no idea about nested emphasis, wtf!?) |
Beta Was this translation helpful? Give feedback.
-
I know. |
Beta Was this translation helpful? Give feedback.
-
Hey @bathos, good conversation here. I'm not sure I follow re Markdown becoming a "superset" of HTML. I look at Markdown as being something separate...almost like XML for people who hate angle brackets. It allows for plain text document definition and typesetting. Sure, it was originally made to convert to HTML but it's grown way beyond that now. Having said that, there is the astute observation that Marked does output HTML and only HTML. However, I don't think Markdown was ever meant to be a full HTML template methodology - a replacement or superset - it's great for making it easy to write a page of content (this comment). So, I think right now we're really trying to get back to that essence of creating a "plain text rich document description that just so happens to convert to HTML quickly and easily" - see also #1043. As web components become more prevalent, we may want to revisit this conversation about how we parse HTML tags (not Markdown) but, for now, I don't know how in the weeds we should go, given the other issues we have. :) |
Beta Was this translation helpful? Give feedback.
-
@joshbruce that’s all reasonable to me, yep. What I meant by superset wasn’t regarding high level intentions, but rather properties of the different grammars. That is, unlike CommonMark, the original (but far less formal) description of Markdown happens to describe a language whose grammar can be modeled (internally) as a refinement of an HTML fragment, specifically of all productions that lead to the creation of text nodes. The reason that property is interesting is that it means one can use a lib like parse5 first and then process unique markdown productions in a second pass with a high degree of confidence in the security* and validity of the output. This is also useful if your intention is to produce an AST rather than directly serialize to HTML. CommonMark’s grammar prevents this technique from being viable. There are advantages also to not doing that, though. For one, applying markdown productions as a refinement grammar would be harder, not easier, and it is likely a lot faster not to. The CommonMark approach is pretty clever: signals of HTML markdown are treated as delimiting escaped blocks. However one could argue that their approach is sorta heuristic in a way that, to me anyway, has unsettling implications. This is a big tangent but since you asked what I meant, that’s it haha. Edit: actually you didn’t ask, you just said you weren’t sure what I meant — sorry for taking this issue thread down the scenic route. * In one sense only; obviously it doesn’t mean the output is secure, but with a true html ast one can e.g. remove script elements and attributes, |
Beta Was this translation helpful? Give feedback.
-
Expectation
The regex for detecting element names is incorrect. It detects some chardata sequences as tags and some tags as chardata. It may be that only the latter is problematic, however — I’ll explain what I mean after the examples.
Example 1: this is chardata —
<123>
Example 2: these are element tags —
<foo.bar></foo.bar>
Result
Example 1:
<123>
Example 2:
<p><foo.bar></foo.bar></p>\n
What was attempted
Reproducible with these minimal examples —
marked('<123>')
,marked('<foo.bar></foo.bar>')
.Explanation
The current tag name matching portion of the
inline.tag
pattern looks like this:[a-zA-Z0-9\-]+
However this isn’t the grammar of HTML tag names (and it also isn’t either a superset or a subset of the grammar). Taking into account that marked has already normalized newline sequences, the lexical production for a tag name can be represented with the regex pattern
[a-zA-Z][^\t\f\n \/>]*
. A tag name begins with an ascii alphabetic (rather than alphanumeric) char and consumes all characters following it until horizontal tab, form feed, line feed, space, slash or greater than is the next character (see § 12.2.5.6 and § 12.2.5.8 for more detail).Although the
<123>
sequence is detected as a tag, since it ends up being passed through as-is, it seems to be harmless; when the resulting HTML is parsed, the agent will understand that this is chardata rather than a tag. However in the reverse case, marked swaps in character references, so the original markup is lost.Beta Was this translation helpful? Give feedback.
All reactions