Support heading attribute block (especially ID and classes) #522

lo48576 · 2021-03-26T19:20:22Z

TODO

Support ID ({#id})
- At this stage, {.class} is simply ignored.
Enable attribute block support only when the specific parser option is enabled
- I'll use the name Options::ENABLE_HEADING_ATTRIBUTES.
Support classes ({.class1 .class2})

~~While this is WIP branch, feel free to give me advice.~~
Now this branch is ready to merge.

Summary

By this patch, section headings will be able to have ID and classes.
For example, # H1 {#id1 .heading} would be converted to <h1 id="id1" class="heading">H1</h1>.

This is a breaking change: Tag::Heading will have multiple fields (ID and classes) instead of single HeadingLevel.

This solves #424.

lo48576 · 2021-03-26T19:34:39Z

Should classes support be merged at once, or be separate pull request?
Personally I wanted to merge id-only support as MVP, however, adding a classes field to Tag::Heading(_, _) is a breaking change, so I'm going to support both ID and classes at once.

Another option is, split attributes to a dedicated HeadingAttributes type, and make Tag::Heading(HeadingLevel, HeadingAttributes).
I like this way more than the current implementation.
However, I see other variants (especially Link and Image) don't use this kind of indirect types, so I'm going to use Heading(HeadingLevel, Option<CowStr<'a>>, Vec<CowStr<'a>>) for consistency.

Any ideas?

lo48576 · 2021-03-31T20:16:02Z

Hmm... Maybe attribute block parsing should be common dedicated function?

lo48576 · 2021-03-31T21:02:56Z

Now attribute block content is parsed by FirstPass::parse_inside_attribute_block().
This method is reusable for other extensions with similar syntax such as [text](uri){.pdf} or ![alt](youtube_uri){.youtube}, although they seem neither planned nor suggested to be implemented.

lo48576 · 2021-04-02T21:01:48Z

Added a test for attribute blocks for ATX heading with trailing #s.

d4h0 · 2021-05-07T08:13:32Z

Being able to add attributes, would be great (especially, if all Markdown elements are supported).

Maybe other attributes, besides 'class' and 'id', can be supported as well?

Something like:

![alt text][image-that-isnt-essential.png]{#id .class1 aria-hidden="true" style="{color: red}"}

This would make Markdown more powerful, and could help to make content more accessible (e.g. via 'aria' attributes).

lo48576 · 2021-05-07T11:16:24Z

Yeah it is possible, but doing so requires public API to be changed breakingly (both from the current master, and from the code changed by this pull request), and deciding syntax would be harder since they might be less widely used.

For example, what should # text {style="color:red; color: green" style="color:blue"} generate? There is no standard for these syntax, and there are multiple options:

<h1 style="color:red; color: green" style="color:blue">text</h1>: as is.
<h1 style="color:red; color: green">text</h1>: the first one wins.
<h1 style="color:blue">text</h1>: the last one wins.
<h1 style="color:red; color:green; color: blue">text</h1>: merged syntactically.
<h1 style="color: blue">text</h1>: merged semantically.

Discussion will be necessary to decide what the processor should do, and I've decided to avoid these problems.

{#id} and {.class} are relatively widely used extensions, and it is easy to handle multiple ones.
(ID is unlikely to be specified more than once, and classes are safe to be merged simply with duplicates.)

So, I'm not going to support more attributes in this pull request.
They are absolutely possible, but should be another issue or pull request.

lo48576 · 2021-05-07T11:30:57Z

And additionally, this PR only supports heading as the first step (because I strongly want this).
Attributes for other elements such as images and links is possible, but they should be another issue or PR.
I named the extension Options::ENABLE_HEADING_ATTRIBUTES since I intentionally limit the target to heading.

There would be something more to think if we want to support the syntax for all (or many) elements.
(For example, can a list item or the whole list have attributes? What happens if attributes duplicates, such as ![alt1](uri1){alt="alt2" src="uri2"}?)
I don't know if there is any consensus around such "wider" use of attributes, so I won't implement them (until there is de jure / de facto standard or I strongly want them).

d4h0 · 2021-05-12T12:19:10Z

(Somehow the GitHub notification ended up in my SPAM folder...)

Thanks for your response, @lo48576

but doing so requires public API to be changed breakingly

Sure, but this PR has already a breaking change. Maybe this could be activated via a feature flag? Not sure, if that would be an ideal solution.

For example, what should # text {style="color:red; color: green" style="color:blue"} generate?

The best option seems, to just pass this through as is, and let the browser handle it (which is ignoring the first attribute, I believe). Another option would be, to forbid duplicate attributes (however, I use duplicate attributes sometimes during debugging, so I'd prefer the former).

For example, can a list item or the whole list have attributes?

I think both should be supported. Something like:

{#ul-id}
- foo {.ul-item}
- bar {.ul-item}
- baz {.ul-item}

What happens if attributes duplicates, such as {alt="alt2" src="uri2"}?

Personally, I'd like this to fail.

I don't know if there is any consensus around such "wider" use of attributes, so I won't implement them (until there is de jure / de facto standard or I strongly want them).

Alright, got it.

hoijui · 2021-06-05T04:56:13Z

Are there issues with this PR, or is it just about the breaking change in the AST?

Side-issue

Is there something like a version for the AST, or a checksum, or tags, or key-value pairs?
That would allow much better error checking, error messages and so on. It would mean, if a software dev using this library updates to a newer version of it, which also changes the AST, he could get a compile-time warning about the AST having changed, instead of - possibly months later - getting strange error reports from users that this and that is not working.
Or one would get a meaningful error message, when expecting the AST to be of a certain manner, but that requires enabling a default-off extension.

lo48576 · 2021-06-05T06:01:21Z

@hoijui Sorry I don't understand what you are worrying about, but if you are thinking about the possibility of silent breakage of downstream crates (i.e. crates depending on pulldown-cmark), it may not happen thanks to semantic versioning.

This pull request has breaking changes around the AST, but this repo already seemed to have breaking changes since the last release (0.8.0) before this pull request starts, so I believe this pull request does not break the downstreams silently (as the next release would be 0.9.0).

hoijui · 2021-06-06T13:18:48Z

Indeed, semantic versioning could be there for the rescue! :-)
thanks!

Non-the less, meta-data annotated AST still makes sense, but it is of course a question of trade of between how much benefit it brings, and the added complexity; which is not trivial.

I will try your branch of this PR (https://github.com/lo48576/pulldown-cmark/tree/feature/heading-attrs) as a dependency in our project, so it will get some real world testing from us too.

kaj · 2021-11-11T21:48:06Z

Great feature, I'd love to see this merged!

One thing: I think it would be nice to not require white-space separating the id and classes. E.g. ## Header {#foo.bar.baz} should result in <h2 id="foo" class="bar baz">Header</h2>. Currently, it seems to yield <h2 id="foo.bar.baz">Header</h2>, which is a bit surprising (and not valid html).

lo48576 · 2021-11-12T12:20:32Z

@kaj I don't find the information that id="foo.bar.baz" is invalid... It seems valid for HTML5, and even for HTML 4.

Note: Using characters except ASCII letters, digits, '_', '-' and '.' may cause compatibility problems, as they weren't allowed in HTML 4. Though this restriction has been lifted in HTML5, an ID should start with a letter for compatibility.

id - HTML: HyperText Markup Language | MDN

Such IDs won't be recommended, but they are certainly valid and the converter shouldn't forbid users use such IDs.

kaj · 2021-11-12T13:47:19Z

@kaj I don't find the information that id="foo.bar.baz" is invalid... It seems valid for HTML5, and even for HTML 4.

True, but it is not the result I would expect. And class="foo#bar.baz" really is invalid in html.

lo48576 · 2021-11-12T17:40:34Z

True, but it is not the result I would expect.

Ok, I understand.
However I don't think it should require special syntax, since there might be no consensus about escaping in attribute blocks.
If someone want to specify id="foo.bar" class="baz", how they can write it? { #foo\.bar .baz }, { #foo%2Ebar .baz }, or they cannot?
I don't want to implement the first two as they are not widely accepted syntax (at least for now).

And class="foo#bar.baz" really is invalid in html.

Yes this is invalid, but I'm not sure the markdown processor should work as an validator in this case.
For example, current pulldown-cmark converts <p class='foo#bar'></p> (in markdown) into <p class='foo#bar'></p> (in HTML).
This is invalid output (due to invalid input), but this would be better than (implicit and potentially non-obvious) auto-sanitizing, since users can intuitively find and recognize what was wrong in their input.
Simpler conversions introduces less advanced rules (which may not have consensus) and more intuitiveness of the conversion and its result.

I think the same applies to #foo.bar or .foo#bar. #foo.bar is valid and .foo#bar is invalid (for HTML), but invalidness of .foo#bar won't be a reason to forbid writing #foo.bar as id="foo.bar", and also .foo#bar shouldn't be automatically converted in non-obvious way.

d4h0 · 2021-11-12T19:41:27Z

I've never seen anything that would interpret #foo.bar.baz as something other than id="foo" class="bar baz", so I strongly agree with @kaj.

I'm not sure why it would be bad to not support the discouraged practice of using certain special characters in identifiers. #foo.bar.baz (or #foo .bar .baz) does already prevent some things that are possible with regular HTML (for example, you can't use an ID with a space in it), so this extension already differs from the spec.

Dots ('.') are extremely uncommon to use in ID or CSS class names because this character has a special meaning in CSS (it starts a CSS class name selector), which has to be escaped if used as part of an identifier.

If someone really has to use an ID or CSS class name that contains a dot, then they still can do this via raw HTML. I don't think, it would be a good idea to decrease the user experience – and the expected output – for everybody just because of these extremely unlikely corner cases.

I think a good compromise, if you really want to support dots in identifiers, would be to support escape sequences: #foo\.bar\.baz.abc.def (resulting in id="foo.bar.baz" class="abc def").

I don't want to implement the first two as they are not widely accepted syntax

foo\.bar is exactly how you would have to use this identifier in your CSS file, so I'm not sure why you believe this isn't a widely accepted syntax.

lo48576 · 2021-11-12T20:28:22Z

If someone really has to use an ID or CSS class name that contains a dot, then they still can do this via raw HTML.

You are right.

foo\.bar is exactly how you would have to use this identifier in your CSS file, so I'm not sure why you believe this isn't a widely accepted syntax.

I mean foo\.bar might be not widely accepted as an escape character for attribute block in markdown, not in other format.
I don't want to make this implementation a new dialect, as there are alreday too many dialects in markdown and many of us suffer from the differences...

I've never seen anything that would interpret #foo.bar.baz as something other than id="foo" class="bar baz"

Thank you for the info. I'm not aware of them since I've only seen people separating attributes with spaces.
Could you give me some examples of such implementations? I'll note them in a code comment as the rationale of the chosen spec.
I tried googling but cannot find markdown examples using {.foo.bar}-style attributes without whitespace separators... Users are well-behaved and they hide implementation details 😭

d4h0 · 2021-11-13T13:56:28Z

So the #id.class syntax come from the CSS selector syntax, #id.class simply means id="id" class="class". If someone wants an ID with id.class they have to write #id\.class in their CSS.

I feel that alone should be enough reason to do it this way.

I haven't curated a list of tools that use the #id.class syntax, but after a quick search I came up with the following:

PugJS (an HTML template engine) supports this syntax

Here is an online playground. If you type in a#foo.bar.baz you get <a class="bar baz" id="foo"></a>.

Emmet is a tool to expand string like a#foo.bar.baz to HTML. a#foo.bar.baz expands to <a href="" id="foo" class="bar baz"></a>.
remarkjs/remark-directive seems to follow the above (it's a markdown extension that adds block syntax that supports attributes with id and class shortcuts). See these tests (but whitespace is optionally allowed between attributes).

Unfortunately, Pandoc does what this PR currently does. And markdown-it-attrs does the same...

I still think it's a mistake to follow them and to not do what most other tools that support the #id.class syntax do (and there are many more than the tools I've mentioned above).

As I wrote above, the #id.class syntax come from the CSS selector syntax, and there #id.class means a element with a 'id' attribute of "id" and a 'class' attribute of "class", and not a element with an 'id' of "id.class".

I mean foo\.bar might be not widely accepted as an escape character for attribute block in markdown, not in other format.

Backslashes are the way to escape things in Markdown/CommonMark (and almost everywhere else), so I don't get what the problem is here.

lo48576 · 2021-11-13T17:16:31Z

I don't think markdown's extension syntax should be only borrowed from other language's syntax.
In other words, each of #foo and .bar would be obviously derived from CSS selector, but I don't think this is enough reason to accept compound #foo.bar as #foo .bar. CSS is stylesheet language, emmet is DSL for editors, and they are not markdown.
(As an example, struct S { ... } of Rust would have been derived from C and C++'s syntax and keyword, but it is not enough reason to require semicolon at the end (stuct S { ... };) or to accept [[attribute]] style (struct [[nonexhaustive]] S { ... })... This is an example for unrelated topic, but do you get what I mean?)

I said "I don't want to make this implementation a new dialect" in the previous post.
My intension is, the extension should do minimal tasks required, in order to avoid making decision that is possibly without consensus.
Implementations in the wild treats {.foo.bar} differently, and there seems no (even de jure) standard of
this extension syntax. This is the proof there are no consensus and people interprets this in non-uniform way.
In such case, I prefer to let the extension add less rules.

"Space-separated attributes are expanded. That's all."-rule is quite simple, and users can quite easily imagine how it will work.

However, "attributes are written in original syntax similar to CSS, but it can be also space-separated."-rule will harder to imagine correctly, and the gap between the users' imagination and the actual implementation may burden users.
For example, how {#id[class="bar"]} should be interpreted? Will it be id="id[foo="bar"]", id="id", id="id" class="bar", or some other?
If [class="bar"] is ignored, why? It is valid CSS selector and specifying a class, which should be expected to be supported.
Or, if id[class="bar"] is used as an ID, why? There is #id.bar case where the selectors are separeted, and #id[class="bar"] case where selectors are not separated!

More rules for "convenience" makes the specification more difficult to understand, and it may make "extremely unlikely corner cases" more likely.
Markdown processors cannot abort on an error, so every input (even when it seems obviously invalid) can be converted to some output. Processors should define rules to convert "invalid"-ish input to some other (typically HTML) form.
The more a processor adds complex "convenient" rule, the more the dialect and processor became incompatible with others.
For this reason, I think it is more important to keep additional conversion rules few and simple as possible, rather than to add convenient rules that typically works but adds more incompatibility and non-intiuitive corner cases.

lo48576 · 2021-11-13T17:38:08Z

I found useful site just now, and compared how {#id.class1.class2} are converted: babelmark3 | Compare Markdown Implementations.

It seems like conversion to id="id.class1.class2" is major enough (cebe (extra), markdig, maruku, and pandoc).

d4h0 · 2021-11-13T22:31:56Z

Alight. I'm not really interested in spending too much time on discussing this. I think both options have advantages and disadvantages. The current version will confuse some users because the output will be unexpected to them, but they will figure it out quickly on their own. If the documentation mentions the requirement for whitespace clearly, it will be even less of a problem.

lo48576 · 2021-11-14T08:21:33Z

Thank you, added a doc comments about that.

lo48576 · 2021-11-14T08:23:21Z

##[warning]An image label with the label ubuntu-16.04 does not exist.
,##[error]The remote provider was unable to process the request.

It seems like this is the problem on the CI settings side, not the code side.

lo48576 · 2021-11-18T21:41:45Z

Added test cases for usual usage, where inlines are used in headings.
No code changes except for tests.

marcusklaas · 2021-11-21T20:40:31Z

src/parse.rs

@@ -79,7 +79,7 @@ pub(crate) enum ItemBody {
    TaskListMarker(bool), // true for checked

    Rule,
-    Heading(HeadingLevel), // heading level
+    Heading(HeadingLevel, HeadingIndex), // heading level


Would it be faster for headings without id or classes if we'd wrap the HeadingIndex in an option? Then we may be able to skip allocations and indirections for that case. It does add another branch, but I suspect it would pay off.

Naively using Option<HeadingIndex> makes ItemBody bigger (24 bytes on x64).

https://github.com/raphlinus/pulldown-cmark/blob/d412d7cb652707f8c980c59dca836a6ad4e49f2f/src/parse.rs#L1495-L1500

It can be easily avoided by using NonZeroUsize as the underlying type of HeadingIndex, instead of usize.
I'm going to modify HeadingIndex to use NonZeroUsize as I think it is worth doing, but tell me if you think the overhead of using NonZeroUsize is unacceptable.

src/firstpass.rs

marcusklaas · 2021-11-21T21:19:38Z

specs/heading_attrs.txt

+}</h2>
+````````````````````````````````
+
+Note that escaping only opening braces does not prevent the braces from


I know this would be a pain to parse properly, but this is mildly surprising. Do other parsers have the same behavior?

babelmark3 | Compare Markdown Implementations

No parsers seems to be able to recognize multi-line attribute blocks.
I don't know how other implementations not supported by babelmark3 would handle such cases, but I think it would be safe to assume no users write attribute blocks in multiline form (as processors cannot handle such attributes).

Woops, I meant escaping opening braces!

babelmark3 | Compare Markdown Implementations

Opening braces can be escaped: php-markdown-extra 1.9.1, pandoc 2.16.2

Escape doesn't work (same as current impl.): markdig (advanced) 0.26.0.0, maruku 0.7.3.beta1

There seems no common behavior...

It would be natural if \{ ... } is not parsed as an attribute block.
And it'll not so hard to support such escapes, since { itself is prohibited in attribute blocks and we don't need to care multiple escapes in various contexts.

I have no strong preference for how this should be parsed. Both of "attribute blocks are trimmed first, before parsing" and "escaped character doesn't behave specially" sounds natural to me.

Ah, ok, I remembered why I avoided supporting escaping.

As I wrote at #522 (comment), it is hard to decide how # h1 {.foo\{.bar} should be converted.

Users' expectation with escaping support would be: content=h1, class=foo{.bar.

The second { is escaped so should be treated as a plain character, and then the first { is opening of the block!

However, with escaping support, this would be converted to: content=h1 {.foo{.bar}, class=None.

This is because { cannot appear in attribute blocks.

Current implementation: content=h1 {.foo\, class=bar.

This is because {.bar} is literally trimmed before processing escapes.

I don't like this ambiguity or unexpectedness when escaping opening braces is allowed, so I decided to trim the block before processing any escapes.
But I'm not confident this is better, so I'll change implementation if you have some preference.

marcusklaas · 2021-11-21T21:20:37Z

specs/heading_attrs.txt

+````````````````````````````````.example
+# H1__ {#my__id1}
+## H2** {#my**id2}
+### H3 ` {.code` }


This is an excellent test case. I do wonder if we are in line with other parsers though?

babelmark3 | Compare Markdown Implementations

Same:

markdig 0.26.0.0 and php-markdown-extra 1.9.0 converts to the same result as this PR.

Similar:

maruku 0.7.3.beta1 seems to try to do the same, but fails with handling of the unclosed inlines (unclosed __baz and __qux are converted to <strong>...).

Different:

cebe/extra 1.2.0, pandoc 2.16.2, and kramdown 1.2.0 gave up on # bar __baz {#my__id2} case.

cebe/extra converts __baz {#my__ into <strong>...</strong>.

pandoc seems to fail to convert inlines, so __baz {#my__ is written as is.

kramdown even fails to make it a heading.

There seems no "common" way to handle this, but there are roughly two patterns: "blocks are trimmed first" and "give up when ambiguous".
(I'm a bit surprised that there is no "inlines takes precedence" pattern.)

I think it is better to use simple rule "attribute blocks are trimmed first", since:

Class names such as .foo__bar would be widely used and I want them to be writable as { .foo__bar }.

I don't think it is good if programatically added {.foo__bar} makes the parsing "ambiguous".

"Give up" patterns have variations, and it is less clear what (should) happens when "ambiguous".

specs/heading_attrs.txt

src/firstpass.rs

marcusklaas · 2021-11-21T21:45:23Z

src/tree.rs

+impl Tree<Item> {
+    /// Truncates the preceding siblings to the given end position,
+    /// and returns the new current node.
+    pub(crate) fn truncate_siblings(&mut self, end_byte_ix: usize) -> Option<TreeIndex> {


This is the only part of the PR I don't fully understand. Why do we need this? I understand we need to truncate the header texts, but it's all one node right? It doesn't get split into multiple nodes until the second parsing phase. Why can't we just adjust the header end?

It is because:

Truncating prefix of the header before parsing seems easy, but truncating suffix would be non-trivial.

Parsing by FirstPass::parse_line seems to include some preparation for the second pass.

Consider the input is # **Foo {#bar** }.
This is ATX heading so FirstPass::parse_atx_heading is called.
At line 1012 on firstpass.rs (master branch), text[ix..] is **Foo {#bar** }, and the parser tries to parse and get a line.

https://github.com/raphlinus/pulldown-cmark/blob/d412d7cb652707f8c980c59dca836a6ad4e49f2f/src/firstpass.rs#L1012

This is, (1) truncating prefix is easy but truncating suffix won't.

Then, FirstPass::parse_line actually does more job than getting a line.
It creates a tree for the input **Foo {#bar** }\n... (half-open range) as below:

MaybeEmphasis (1st, can open)

MaybeEmphasis (2nd, can open)

Text (Foo {#bar)

MaybeEmphasis (1st, can close)

MaybeEmphasis (2nd, can close)

Text ( })

Second pass utilizes this MaybeEmphasis's to create real tree structure.
This is, (2) parsing by FirstPass::parse_line seems to include some preparation for the second pass.

In my current implementation, I use parse_line same way as before.
And after that parsing, modify the (relatively) flat tree into truncated form, such as below:

MaybeEmphasis (1st, can open)

MaybeEmphasis (2nd, can open)

Text (Foo)

This is what I wanted to say by #522 (comment):

For example, this new method is used when turning __/foo {.bar/__/} into __/foo, since {.bar__} suffix should be removed as an attribute block.

Ideally, the suffix truncation should be done before parse_line, but it seems parse_line only accepts half-open byte slice.
Making it accept closed range of the bytes might be large diff, and I tried to avoid making such big change that would affect every part of the parsing.

Should I modify parse_line to accept not only start: usize but also end: Option<usize>?

Thank you for the detailed explanation, I had not considered that parse_line already adds these tree elements! It makes total sense why you do these things now.

However, I do think that it would be cleaner to scan for the attribute block first (if the option is set!) and only after that calling the parse_line function on the prefix. For two reasons: we wouldn't waste effort parsing the header as if it were normal markdown, and truncating the header text would become trivial.

We would indeed have to adjust parse_line it seems, and adding an optional end the way you suggest looks like a good way of doing it! 👍🏽

I tried rewriting parse_atx_heading() without truncate_siblings() (by trimming the attribute block before parsing the line) and succeeded.

However, it seems hard to do the same for parse_setext_heading() since it transforms an already-parsed paragraph into a heading.
At the time lines are being parsed, the parser cannot know whether they are setext heading content.
So it will be necessary to modify already-parsed tree in that case and I'll continue to use truncate_siblings() for setext heading.

marcusklaas · 2021-11-21T21:47:37Z

Very good progress on this pull request, and a special kudos for the excellent test suite you have created!

I went through all the technical details this time, and left a few more comments. Most of them are small, with the only real significant thing to consider is wrapping the HeadingIndex in an Option.

This is mainly changes on types. Parsers for ID fragments and classes are not yet implemented.

lo48576 · 2021-11-22T12:03:11Z

Diffs since the previous push: https://github.com/raphlinus/pulldown-cmark/compare/4ce6d11dff48f4549c585400ac5a137269103a6b..ab1c47abc00acbe8ec9311e671aff316ab95c776

Rebased onto new master.
Add more tests and docs for whitespaces after the heading content, after the attribute blocks, and inside attribute blocks.
Fix inconsistency for "# H1\t{}" case.
- Previously it was converted to "<h1>H1</h1>", but now it is converted to "<h1>H1\t</h1>".
- "# H1\t" is converted to "<h1>H1\t</h1>", so the trimming became consistent now.
Use Option<HeadingIndex> rather than mandatory HeadingIndex.
- Thanks to NonZeroUsize backend, type size of ItemBody doesn't change.
Split changes to test generator into another PR Sort auto-generated tests modules by spec names #546.

lo48576 · 2021-11-23T18:48:00Z

Fix inconsistent close tags in expected HTML outputs in the spec file.
- Test suite parses HTML semantically and normalizes it, so inconsistent close tags (such as <h1>foo</h2>) doesn't cause test failures and I overlooked mistakes.
Fix wrong wording.
- If an ATX heading have closing #s, s/attribute blocks are placed before that/the attribute block should be placed after them/.
Add a test case with closing #s after {...}: #### non-attribute-block {#id4} ####.

src/firstpass.rs

marcusklaas · 2021-12-01T11:56:41Z

Sorry for the long delay on my next pass here, Yoshioka. It's been a very busy week at work.

The good news is that this that besides the one comment I have on the parse_line issue, this PR is in excellent shape! If you could touch up that final point, I will merge it :-)

Thank you so much for your hard work and your patience. This will be a much awaited contribution to pulldown!

lo48576 · 2021-12-02T14:43:02Z

Diff from the previous push: https://github.com/raphlinus/pulldown-cmark/compare/966aefce25ecfa55fd43e3dbaf0f90b478ff9073..a17ae72a0a758e7da237c9c05ee378ef21a0eeba

In ATX headings, trim attribute blocks before parsing the line.
Fix inconsistent behaviour for trailing backslash in headings.
Reduce if-else-if chain and nest level in Tree::<Item>::truncate_siblings() by early return.

About trailing backslash in headings

Comparisons: https://babelmark.github.io/?text=%23+Foo+%5C%0ABar+%5C%0A--

commonmark-compliant parsers seem to preserve the trailing stray backslash: for example # Foo \ into <h1>Foo \</h1>.
However, pulldown-cmark (at the master branch) currently trims them.
I updated parsing of ATX headings to avoid use of truncate_siblings() and the stray backslash problem is fixed for ATX headings with attribute blocks as a side effect, so I also fixed it also for ATX headings without attribute blocks and setext headings.

marcusklaas · 2021-12-06T08:58:09Z

Just ran some benchmarks as a final check, and it seems this PR does not change the performance of the library in a measurable way. Let's merge this thing! 🚀

Thanks again for your hard work, patience and substantial contribution to the library @lo48576!

lo48576 force-pushed the feature/heading-attrs branch 2 times, most recently from 3f61b99 to a13d1b1 Compare March 30, 2021 16:00

lo48576 marked this pull request as ready for review March 30, 2021 16:12

lo48576 marked this pull request as draft March 31, 2021 20:16

lo48576 marked this pull request as ready for review March 31, 2021 20:56

lo48576 force-pushed the feature/heading-attrs branch from dbd6836 to 9e0e45d Compare April 2, 2021 20:57

hoijui mentioned this pull request Jun 5, 2021

Support anchor link targets becheran/mlc#31

Open