Element names #1071

bathos · 2018-02-22T11:56:32Z

bathos
Feb 22, 2018

Expectation

The regex for detecting element names is incorrect. It detects some chardata sequences as tags and some tags as chardata. It may be that only the latter is problematic, however — I’ll explain what I mean after the examples.

Example 1: this is chardata — <123>
Example 2: these are element tags — <foo.bar></foo.bar>

Result

Example 1: <123>
Example 2: <p><foo.bar></foo.bar></p>\n

What was attempted

Reproducible with these minimal examples — marked('<123>'), marked('<foo.bar></foo.bar>').

Explanation

The current tag name matching portion of the inline.tag pattern looks like this:

[a-zA-Z0-9\-]+

However this isn’t the grammar of HTML tag names (and it also isn’t either a superset or a subset of the grammar). Taking into account that marked has already normalized newline sequences, the lexical production for a tag name can be represented with the regex pattern [a-zA-Z][^\t\f\n \/>]*. A tag name begins with an ascii alphabetic (rather than alphanumeric) char and consumes all characters following it until horizontal tab, form feed, line feed, space, slash or greater than is the next character (see § 12.2.5.6 and § 12.2.5.8 for more detail).

Although the <123> sequence is detected as a tag, since it ends up being passed through as-is, it seems to be harmless; when the resulting HTML is parsed, the agent will understand that this is chardata rather than a tag. However in the reverse case, marked swaps in character references, so the original markup is lost.

Feder1co5oave · 2018-02-22T16:21:49Z

Feder1co5oave
Feb 22, 2018

I'm working on this also because of #1058.
#985

0 replies

Feder1co5oave · 2018-02-23T19:02:40Z

Feder1co5oave
Feb 23, 2018

I see that the HTML specification on how to parse html code is very specific but I think we're not its target users. Markdown html syntax allows the author to pass some simple html code in the output document as-is, mainly for the purpose of formatting or embedding some content whose type is not supported by markdown itself (e.g. videos, etc.). So I'm okay with fixing the rule to match only tags that start with an ascii letter but I see that allowing too broad a character set for the tag name might lead to unexpected results for the users.

For reference, the commonmark spec states:

A tag name consists of an ASCII letter followed by zero or more ASCII letters, digits, or hyphens (-).

This is even too much restricting in my opinion, since it doesn't allow for example namespace-qualified tags, such as svg or open graph tags (I guess)

0 replies

bathos · 2018-02-23T19:51:07Z

bathos
Feb 23, 2018
Author

I’m curious why that would be — there’s a definition for what constitutes a tag name that’s unambiguous and (unlike most aspects of HTML) can be described with a regular grammar. It doesn’t seem to present a challenge in HTML, which is the output of this lib, so I’d say suggest it’s odd to depart from HTML’s definition of an HTML construct unless the usefulness of doing so has been proven.

In any case, if using Commonmark’s definition, the current pattern is not a match — it considers <-> and <9> to be tags, which isn’t true in either Commonmark or HTML’s production.

(I do understand the desire to conform to Commonmark, though I think it’s too bad that while the original definition of markdown, though fuzzy, permits a reading which makes markdown a superset of HTML, Commonmark doesn’t, both here and in a few other spots.)

0 replies

Feder1co5oave · 2018-02-23T20:15:40Z

Feder1co5oave
Feb 23, 2018

I think I've just answered to your valid points in my previous message:

I’m curious why that would be

allowing too broad a character set for the tag name might lead to unexpected results for the users.

(this is my opinion and it is not proven, of course)

the current pattern is not a match

So I'm okay with fixing the rule to match only tags that start with an ascii letter

0 replies

Feder1co5oave · 2018-02-23T20:32:12Z

Feder1co5oave
Feb 23, 2018

I'd like to add something:

I tend to prefer conforming to commonmark above all else, except when I think there's a valid reason not to. For example, I explained in #1036 why I hate their specification for emphasis.
And I also think I won't comply fully with their spec about HTML comments since it is stricter than what actual browsers recognize as comments. For example, test number 597 and 598 show that comments with a double hyphen inside them are not allowed, while in fact they work in most browsers.

The commonmark attribute name rule and example 590 clash with the whatwh specification for attribute name parsing.

0 replies

bathos · 2018-02-23T20:59:44Z

bathos
Feb 23, 2018
Author

Thanks, the context about goals/past decisions is helpful. (I had no idea about nested emphasis, wtf!?)

0 replies

Feder1co5oave · 2018-02-23T21:20:28Z

Feder1co5oave
Feb 23, 2018

I had no idea about nested emphasis, wtf!?

I know.

0 replies

joshbruce · 2018-02-24T14:52:21Z

joshbruce
Feb 24, 2018
Maintainer

Hey @bathos, good conversation here. I'm not sure I follow re Markdown becoming a "superset" of HTML. I look at Markdown as being something separate...almost like XML for people who hate angle brackets. It allows for plain text document definition and typesetting.

Sure, it was originally made to convert to HTML but it's grown way beyond that now. Having said that, there is the astute observation that Marked does output HTML and only HTML. However, I don't think Markdown was ever meant to be a full HTML template methodology - a replacement or superset - it's great for making it easy to write a page of content (this comment). So, I think right now we're really trying to get back to that essence of creating a "plain text rich document description that just so happens to convert to HTML quickly and easily" - see also #1043.

As web components become more prevalent, we may want to revisit this conversation about how we parse HTML tags (not Markdown) but, for now, I don't know how in the weeds we should go, given the other issues we have. :)

0 replies

bathos · 2018-02-24T18:38:59Z

bathos
Feb 24, 2018
Author

@joshbruce that’s all reasonable to me, yep.

What I meant by superset wasn’t regarding high level intentions, but rather properties of the different grammars. That is, unlike CommonMark, the original (but far less formal) description of Markdown happens to describe a language whose grammar can be modeled (internally) as a refinement of an HTML fragment, specifically of all productions that lead to the creation of text nodes.

The reason that property is interesting is that it means one can use a lib like parse5 first and then process unique markdown productions in a second pass with a high degree of confidence in the security* and validity of the output. This is also useful if your intention is to produce an AST rather than directly serialize to HTML. CommonMark’s grammar prevents this technique from being viable.

There are advantages also to not doing that, though. For one, applying markdown productions as a refinement grammar would be harder, not easier, and it is likely a lot faster not to. The CommonMark approach is pretty clever: signals of HTML markdown are treated as delimiting escaped blocks. However one could argue that their approach is sorta heuristic in a way that, to me anyway, has unsettling implications.

This is a big tangent but since you asked what I meant, that’s it haha. Edit: actually you didn’t ask, you just said you weren’t sure what I meant — sorry for taking this issue thread down the scenic route.

* In one sense only; obviously it doesn’t mean the output is secure, but with a true html ast one can e.g. remove script elements and attributes, <plaintext>, etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Element names #1071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Element names #1071

bathos Feb 22, 2018

Expectation

Result

What was attempted

Explanation

Replies: 9 comments

Feder1co5oave Feb 22, 2018

Feder1co5oave Feb 23, 2018

bathos Feb 23, 2018 Author

Feder1co5oave Feb 23, 2018

Feder1co5oave Feb 23, 2018

bathos Feb 23, 2018 Author

Feder1co5oave Feb 23, 2018

joshbruce Feb 24, 2018 Maintainer

bathos Feb 24, 2018 Author

bathos
Feb 22, 2018

Feder1co5oave
Feb 22, 2018

Feder1co5oave
Feb 23, 2018

bathos
Feb 23, 2018
Author

Feder1co5oave
Feb 23, 2018

Feder1co5oave
Feb 23, 2018

bathos
Feb 23, 2018
Author

Feder1co5oave
Feb 23, 2018

joshbruce
Feb 24, 2018
Maintainer

bathos
Feb 24, 2018
Author