Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support heading attribute block (especially ID and classes) #522

Merged
merged 3 commits into from
Dec 6, 2021

Conversation

lo48576
Copy link
Contributor

@lo48576 lo48576 commented Mar 26, 2021

TODO

  • Support ID ({#id})
    • At this stage, {.class} is simply ignored.
  • Enable attribute block support only when the specific parser option is enabled
    • I'll use the name Options::ENABLE_HEADING_ATTRIBUTES.
  • Support classes ({.class1 .class2})

While this is WIP branch, feel free to give me advice.
Now this branch is ready to merge.

Summary

By this patch, section headings will be able to have ID and classes.
For example, # H1 {#id1 .heading} would be converted to <h1 id="id1" class="heading">H1</h1>.

This is a breaking change: Tag::Heading will have multiple fields (ID and classes) instead of single HeadingLevel.

This solves #424.

@lo48576
Copy link
Contributor Author

lo48576 commented Mar 26, 2021

Should classes support be merged at once, or be separate pull request?
Personally I wanted to merge id-only support as MVP, however, adding a classes field to Tag::Heading(_, _) is a breaking change, so I'm going to support both ID and classes at once.

Another option is, split attributes to a dedicated HeadingAttributes type, and make Tag::Heading(HeadingLevel, HeadingAttributes).
I like this way more than the current implementation.
However, I see other variants (especially Link and Image) don't use this kind of indirect types, so I'm going to use Heading(HeadingLevel, Option<CowStr<'a>>, Vec<CowStr<'a>>) for consistency.

Any ideas?

@lo48576 lo48576 force-pushed the feature/heading-attrs branch 2 times, most recently from 3f61b99 to a13d1b1 Compare March 30, 2021 16:00
@lo48576 lo48576 marked this pull request as ready for review March 30, 2021 16:12
@lo48576
Copy link
Contributor Author

lo48576 commented Mar 31, 2021

Hmm... Maybe attribute block parsing should be common dedicated function?

@lo48576 lo48576 marked this pull request as draft March 31, 2021 20:16
@lo48576 lo48576 marked this pull request as ready for review March 31, 2021 20:56
@lo48576
Copy link
Contributor Author

lo48576 commented Mar 31, 2021

Now attribute block content is parsed by FirstPass::parse_inside_attribute_block().
This method is reusable for other extensions with similar syntax such as [text](uri){.pdf} or ![alt](youtube_uri){.youtube}, although they seem neither planned nor suggested to be implemented.

@lo48576
Copy link
Contributor Author

lo48576 commented Apr 2, 2021

Added a test for attribute blocks for ATX heading with trailing #s.

@d4h0
Copy link

d4h0 commented May 7, 2021

Being able to add attributes, would be great (especially, if all Markdown elements are supported).

Maybe other attributes, besides 'class' and 'id', can be supported as well?

Something like:

![alt text][image-that-isnt-essential.png]{#id .class1 aria-hidden="true" style="{color: red}"}

This would make Markdown more powerful, and could help to make content more accessible (e.g. via 'aria' attributes).

@lo48576
Copy link
Contributor Author

lo48576 commented May 7, 2021

Yeah it is possible, but doing so requires public API to be changed breakingly (both from the current master, and from the code changed by this pull request), and deciding syntax would be harder since they might be less widely used.

For example, what should # text {style="color:red; color: green" style="color:blue"} generate? There is no standard for these syntax, and there are multiple options:

  • <h1 style="color:red; color: green" style="color:blue">text</h1>: as is.
  • <h1 style="color:red; color: green">text</h1>: the first one wins.
  • <h1 style="color:blue">text</h1>: the last one wins.
  • <h1 style="color:red; color:green; color: blue">text</h1>: merged syntactically.
  • <h1 style="color: blue">text</h1>: merged semantically.

Discussion will be necessary to decide what the processor should do, and I've decided to avoid these problems.

{#id} and {.class} are relatively widely used extensions, and it is easy to handle multiple ones.
(ID is unlikely to be specified more than once, and classes are safe to be merged simply with duplicates.)

So, I'm not going to support more attributes in this pull request.
They are absolutely possible, but should be another issue or pull request.

@lo48576
Copy link
Contributor Author

lo48576 commented May 7, 2021

And additionally, this PR only supports heading as the first step (because I strongly want this).
Attributes for other elements such as images and links is possible, but they should be another issue or PR.
I named the extension Options::ENABLE_HEADING_ATTRIBUTES since I intentionally limit the target to heading.

There would be something more to think if we want to support the syntax for all (or many) elements.
(For example, can a list item or the whole list have attributes? What happens if attributes duplicates, such as ![alt1](uri1){alt="alt2" src="uri2"}?)
I don't know if there is any consensus around such "wider" use of attributes, so I won't implement them (until there is de jure / de facto standard or I strongly want them).

@d4h0
Copy link

d4h0 commented May 12, 2021

(Somehow the GitHub notification ended up in my SPAM folder...)

Thanks for your response, @lo48576

but doing so requires public API to be changed breakingly

Sure, but this PR has already a breaking change. Maybe this could be activated via a feature flag? Not sure, if that would be an ideal solution.

For example, what should # text {style="color:red; color: green" style="color:blue"} generate?

The best option seems, to just pass this through as is, and let the browser handle it (which is ignoring the first attribute, I believe). Another option would be, to forbid duplicate attributes (however, I use duplicate attributes sometimes during debugging, so I'd prefer the former).

For example, can a list item or the whole list have attributes?

I think both should be supported. Something like:

{#ul-id}
- foo {.ul-item}
- bar {.ul-item}
- baz {.ul-item}

What happens if attributes duplicates, such as alt1{alt="alt2" src="uri2"}?

Personally, I'd like this to fail.

I don't know if there is any consensus around such "wider" use of attributes, so I won't implement them (until there is de jure / de facto standard or I strongly want them).

Alright, got it.

@hoijui
Copy link

hoijui commented Jun 5, 2021

Are there issues with this PR, or is it just about the breaking change in the AST?

Side-issue

Is there something like a version for the AST, or a checksum, or tags, or key-value pairs?
That would allow much better error checking, error messages and so on. It would mean, if a software dev using this library updates to a newer version of it, which also changes the AST, he could get a compile-time warning about the AST having changed, instead of - possibly months later - getting strange error reports from users that this and that is not working.
Or one would get a meaningful error message, when expecting the AST to be of a certain manner, but that requires enabling a default-off extension.

@lo48576
Copy link
Contributor Author

lo48576 commented Jun 5, 2021

@hoijui Sorry I don't understand what you are worrying about, but if you are thinking about the possibility of silent breakage of downstream crates (i.e. crates depending on pulldown-cmark), it may not happen thanks to semantic versioning.

This pull request has breaking changes around the AST, but this repo already seemed to have breaking changes since the last release (0.8.0) before this pull request starts, so I believe this pull request does not break the downstreams silently (as the next release would be 0.9.0).

@hoijui
Copy link

hoijui commented Jun 6, 2021

Indeed, semantic versioning could be there for the rescue! :-)
thanks!

Non-the less, meta-data annotated AST still makes sense, but it is of course a question of trade of between how much benefit it brings, and the added complexity; which is not trivial.

I will try your branch of this PR (https://github.com/lo48576/pulldown-cmark/tree/feature/heading-attrs) as a dependency in our project, so it will get some real world testing from us too.

@kaj
Copy link

kaj commented Nov 11, 2021

Great feature, I'd love to see this merged!

One thing: I think it would be nice to not require white-space separating the id and classes. E.g. ## Header {#foo.bar.baz} should result in <h2 id="foo" class="bar baz">Header</h2>. Currently, it seems to yield <h2 id="foo.bar.baz">Header</h2>, which is a bit surprising (and not valid html).

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 12, 2021

@kaj I don't find the information that id="foo.bar.baz" is invalid... It seems valid for HTML5, and even for HTML 4.

Note: Using characters except ASCII letters, digits, '_', '-' and '.' may cause compatibility problems, as they weren't allowed in HTML 4. Though this restriction has been lifted in HTML5, an ID should start with a letter for compatibility.

id - HTML: HyperText Markup Language | MDN

Such IDs won't be recommended, but they are certainly valid and the converter shouldn't forbid users use such IDs.

@kaj
Copy link

kaj commented Nov 12, 2021

@kaj I don't find the information that id="foo.bar.baz" is invalid... It seems valid for HTML5, and even for HTML 4.

True, but it is not the result I would expect. And class="foo#bar.baz" really is invalid in html.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 12, 2021

True, but it is not the result I would expect.

Ok, I understand.
However I don't think it should require special syntax, since there might be no consensus about escaping in attribute blocks.
If someone want to specify id="foo.bar" class="baz", how they can write it? { #foo\.bar .baz }, { #foo%2Ebar .baz }, or they cannot?
I don't want to implement the first two as they are not widely accepted syntax (at least for now).

And class="foo#bar.baz" really is invalid in html.

Yes this is invalid, but I'm not sure the markdown processor should work as an validator in this case.
For example, current pulldown-cmark converts <p class='foo#bar'></p> (in markdown) into <p class='foo#bar'></p> (in HTML).
This is invalid output (due to invalid input), but this would be better than (implicit and potentially non-obvious) auto-sanitizing, since users can intuitively find and recognize what was wrong in their input.
Simpler conversions introduces less advanced rules (which may not have consensus) and more intuitiveness of the conversion and its result.

I think the same applies to #foo.bar or .foo#bar. #foo.bar is valid and .foo#bar is invalid (for HTML), but invalidness of .foo#bar won't be a reason to forbid writing #foo.bar as id="foo.bar", and also .foo#bar shouldn't be automatically converted in non-obvious way.

@d4h0
Copy link

d4h0 commented Nov 12, 2021

I've never seen anything that would interpret #foo.bar.baz as something other than id="foo" class="bar baz", so I strongly agree with @kaj.

I'm not sure why it would be bad to not support the discouraged practice of using certain special characters in identifiers. #foo.bar.baz (or #foo .bar .baz) does already prevent some things that are possible with regular HTML (for example, you can't use an ID with a space in it), so this extension already differs from the spec.

Dots ('.') are extremely uncommon to use in ID or CSS class names because this character has a special meaning in CSS (it starts a CSS class name selector), which has to be escaped if used as part of an identifier.

If someone really has to use an ID or CSS class name that contains a dot, then they still can do this via raw HTML. I don't think, it would be a good idea to decrease the user experience – and the expected output – for everybody just because of these extremely unlikely corner cases.

I think a good compromise, if you really want to support dots in identifiers, would be to support escape sequences: #foo\.bar\.baz.abc.def (resulting in id="foo.bar.baz" class="abc def").

I don't want to implement the first two as they are not widely accepted syntax

foo\.bar is exactly how you would have to use this identifier in your CSS file, so I'm not sure why you believe this isn't a widely accepted syntax.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 12, 2021

If someone really has to use an ID or CSS class name that contains a dot, then they still can do this via raw HTML.

You are right.

foo\.bar is exactly how you would have to use this identifier in your CSS file, so I'm not sure why you believe this isn't a widely accepted syntax.

I mean foo\.bar might be not widely accepted as an escape character for attribute block in markdown, not in other format.
I don't want to make this implementation a new dialect, as there are alreday too many dialects in markdown and many of us suffer from the differences...

I've never seen anything that would interpret #foo.bar.baz as something other than id="foo" class="bar baz"

Thank you for the info. I'm not aware of them since I've only seen people separating attributes with spaces.
Could you give me some examples of such implementations? I'll note them in a code comment as the rationale of the chosen spec.
I tried googling but cannot find markdown examples using {.foo.bar}-style attributes without whitespace separators... Users are well-behaved and they hide implementation details 😭

@d4h0
Copy link

d4h0 commented Nov 13, 2021

So the #id.class syntax come from the CSS selector syntax, #id.class simply means id="id" class="class". If someone wants an ID with id.class they have to write #id\.class in their CSS.

I feel that alone should be enough reason to do it this way.

I haven't curated a list of tools that use the #id.class syntax, but after a quick search I came up with the following:

  1. PugJS (an HTML template engine) supports this syntax

Here is an online playground. If you type in a#foo.bar.baz you get <a class="bar baz" id="foo"></a>.

  1. Emmet is a tool to expand string like a#foo.bar.baz to HTML. a#foo.bar.baz expands to <a href="" id="foo" class="bar baz"></a>.

  2. remarkjs/remark-directive seems to follow the above (it's a markdown extension that adds block syntax that supports attributes with id and class shortcuts). See these tests (but whitespace is optionally allowed between attributes).

Unfortunately, Pandoc does what this PR currently does. And markdown-it-attrs does the same...

I still think it's a mistake to follow them and to not do what most other tools that support the #id.class syntax do (and there are many more than the tools I've mentioned above).

As I wrote above, the #id.class syntax come from the CSS selector syntax, and there #id.class means a element with a 'id' attribute of "id" and a 'class' attribute of "class", and not a element with an 'id' of "id.class".

I mean foo\.bar might be not widely accepted as an escape character for attribute block in markdown, not in other format.

Backslashes are the way to escape things in Markdown/CommonMark (and almost everywhere else), so I don't get what the problem is here.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 13, 2021

I don't think markdown's extension syntax should be only borrowed from other language's syntax.
In other words, each of #foo and .bar would be obviously derived from CSS selector, but I don't think this is enough reason to accept compound #foo.bar as #foo .bar. CSS is stylesheet language, emmet is DSL for editors, and they are not markdown.
(As an example, struct S { ... } of Rust would have been derived from C and C++'s syntax and keyword, but it is not enough reason to require semicolon at the end (stuct S { ... };) or to accept [[attribute]] style (struct [[nonexhaustive]] S { ... })... This is an example for unrelated topic, but do you get what I mean?)

I said "I don't want to make this implementation a new dialect" in the previous post.
My intension is, the extension should do minimal tasks required, in order to avoid making decision that is possibly without consensus.
Implementations in the wild treats {.foo.bar} differently, and there seems no (even de jure) standard of
this extension syntax. This is the proof there are no consensus and people interprets this in non-uniform way.
In such case, I prefer to let the extension add less rules.

"Space-separated attributes are expanded. That's all."-rule is quite simple, and users can quite easily imagine how it will work.

However, "attributes are written in original syntax similar to CSS, but it can be also space-separated."-rule will harder to imagine correctly, and the gap between the users' imagination and the actual implementation may burden users.
For example, how {#id[class="bar"]} should be interpreted? Will it be id="id[foo=&quot;bar&quot;]", id="id", id="id" class="bar", or some other?
If [class="bar"] is ignored, why? It is valid CSS selector and specifying a class, which should be expected to be supported.
Or, if id[class="bar"] is used as an ID, why? There is #id.bar case where the selectors are separeted, and #id[class="bar"] case where selectors are not separated!

More rules for "convenience" makes the specification more difficult to understand, and it may make "extremely unlikely corner cases" more likely.
Markdown processors cannot abort on an error, so every input (even when it seems obviously invalid) can be converted to some output. Processors should define rules to convert "invalid"-ish input to some other (typically HTML) form.
The more a processor adds complex "convenient" rule, the more the dialect and processor became incompatible with others.
For this reason, I think it is more important to keep additional conversion rules few and simple as possible, rather than to add convenient rules that typically works but adds more incompatibility and non-intiuitive corner cases.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 13, 2021

I found useful site just now, and compared how {#id.class1.class2} are converted: babelmark3 | Compare Markdown Implementations.

It seems like conversion to id="id.class1.class2" is major enough (cebe (extra), markdig, maruku, and pandoc).

@d4h0
Copy link

d4h0 commented Nov 13, 2021

Alight. I'm not really interested in spending too much time on discussing this. I think both options have advantages and disadvantages. The current version will confuse some users because the output will be unexpected to them, but they will figure it out quickly on their own. If the documentation mentions the requirement for whitespace clearly, it will be even less of a problem.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 14, 2021

Thank you, added a doc comments about that.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 14, 2021

##[warning]An image label with the label ubuntu-16.04 does not exist.
,##[error]The remote provider was unable to process the request.

It seems like this is the problem on the CI settings side, not the code side.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 18, 2021

Added test cases for usual usage, where inlines are used in headings.
No code changes except for tests.

src/parse.rs Outdated
@@ -79,7 +79,7 @@ pub(crate) enum ItemBody {
TaskListMarker(bool), // true for checked

Rule,
Heading(HeadingLevel), // heading level
Heading(HeadingLevel, HeadingIndex), // heading level
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be faster for headings without id or classes if we'd wrap the HeadingIndex in an option? Then we may be able to skip allocations and indirections for that case. It does add another branch, but I suspect it would pay off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively using Option<HeadingIndex> makes ItemBody bigger (24 bytes on x64).

https://github.com/raphlinus/pulldown-cmark/blob/d412d7cb652707f8c980c59dca836a6ad4e49f2f/src/parse.rs#L1495-L1500

It can be easily avoided by using NonZeroUsize as the underlying type of HeadingIndex, instead of usize.
I'm going to modify HeadingIndex to use NonZeroUsize as I think it is worth doing, but tell me if you think the overhead of using NonZeroUsize is unacceptable.

src/firstpass.rs Outdated Show resolved Hide resolved
src/firstpass.rs Outdated Show resolved Hide resolved
src/firstpass.rs Outdated Show resolved Hide resolved
}</h2>
````````````````````````````````

Note that escaping only opening braces does not prevent the braces from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this would be a pain to parse properly, but this is mildly surprising. Do other parsers have the same behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

babelmark3 | Compare Markdown Implementations

No parsers seems to be able to recognize multi-line attribute blocks.
I don't know how other implementations not supported by babelmark3 would handle such cases, but I think it would be safe to assume no users write attribute blocks in multiline form (as processors cannot handle such attributes).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, I meant escaping opening braces!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

babelmark3 | Compare Markdown Implementations

  • Opening braces can be escaped: php-markdown-extra 1.9.1, pandoc 2.16.2
  • Escape doesn't work (same as current impl.): markdig (advanced) 0.26.0.0, maruku 0.7.3.beta1

There seems no common behavior...

It would be natural if \{ ... } is not parsed as an attribute block.
And it'll not so hard to support such escapes, since { itself is prohibited in attribute blocks and we don't need to care multiple escapes in various contexts.

I have no strong preference for how this should be parsed. Both of "attribute blocks are trimmed first, before parsing" and "escaped character doesn't behave specially" sounds natural to me.

Copy link
Contributor Author

@lo48576 lo48576 Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, I remembered why I avoided supporting escaping.

As I wrote at #522 (comment), it is hard to decide how # h1 {.foo\{.bar} should be converted.

  • Users' expectation with escaping support would be: content=h1, class=foo{.bar.
    • The second { is escaped so should be treated as a plain character, and then the first { is opening of the block!
  • However, with escaping support, this would be converted to: content=h1 {.foo{.bar}, class=None.
    • This is because { cannot appear in attribute blocks.
  • Current implementation: content=h1 {.foo\, class=bar.
    • This is because {.bar} is literally trimmed before processing escapes.

I don't like this ambiguity or unexpectedness when escaping opening braces is allowed, so I decided to trim the block before processing any escapes.
But I'm not confident this is better, so I'll change implementation if you have some preference.

````````````````````````````````.example
# H1__ {#my__id1}
## H2** {#my**id2}
### H3 ` {.code` }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent test case. I do wonder if we are in line with other parsers though?

Copy link
Contributor Author

@lo48576 lo48576 Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

babelmark3 | Compare Markdown Implementations

  • Same:
    • markdig 0.26.0.0 and php-markdown-extra 1.9.0 converts to the same result as this PR.
  • Similar:
    • maruku 0.7.3.beta1 seems to try to do the same, but fails with handling of the unclosed inlines (unclosed __baz and __qux are converted to <strong>...).
  • Different:
    • cebe/extra 1.2.0, pandoc 2.16.2, and kramdown 1.2.0 gave up on # bar __baz {#my__id2} case.
    • cebe/extra converts __baz {#my__ into <strong>...</strong>.
    • pandoc seems to fail to convert inlines, so __baz {#my__ is written as is.
    • kramdown even fails to make it a heading.

There seems no "common" way to handle this, but there are roughly two patterns: "blocks are trimmed first" and "give up when ambiguous".
(I'm a bit surprised that there is no "inlines takes precedence" pattern.)

I think it is better to use simple rule "attribute blocks are trimmed first", since:

  • Class names such as .foo__bar would be widely used and I want them to be writable as { .foo__bar }.
    • I don't think it is good if programatically added {.foo__bar} makes the parsing "ambiguous".
  • "Give up" patterns have variations, and it is less clear what (should) happens when "ambiguous".

src/firstpass.rs Outdated Show resolved Hide resolved
src/tree.rs Outdated
impl Tree<Item> {
/// Truncates the preceding siblings to the given end position,
/// and returns the new current node.
pub(crate) fn truncate_siblings(&mut self, end_byte_ix: usize) -> Option<TreeIndex> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part of the PR I don't fully understand. Why do we need this? I understand we need to truncate the header texts, but it's all one node right? It doesn't get split into multiple nodes until the second parsing phase. Why can't we just adjust the header end?

Copy link
Contributor Author

@lo48576 lo48576 Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because:

  1. Truncating prefix of the header before parsing seems easy, but truncating suffix would be non-trivial.
  2. Parsing by FirstPass::parse_line seems to include some preparation for the second pass.

Consider the input is # **Foo {#bar** }.
This is ATX heading so FirstPass::parse_atx_heading is called.
At line 1012 on firstpass.rs (master branch), text[ix..] is **Foo {#bar** }, and the parser tries to parse and get a line.

https://github.com/raphlinus/pulldown-cmark/blob/d412d7cb652707f8c980c59dca836a6ad4e49f2f/src/firstpass.rs#L1012

This is, (1) truncating prefix is easy but truncating suffix won't.

Then, FirstPass::parse_line actually does more job than getting a line.
It creates a tree for the input **Foo {#bar** }\n... (half-open range) as below:

  • MaybeEmphasis (1st, can open)
  • MaybeEmphasis (2nd, can open)
  • Text (Foo {#bar)
  • MaybeEmphasis (1st, can close)
  • MaybeEmphasis (2nd, can close)
  • Text ( })

Second pass utilizes this MaybeEmphasis's to create real tree structure.
This is, (2) parsing by FirstPass::parse_line seems to include some preparation for the second pass.


In my current implementation, I use parse_line same way as before.
And after that parsing, modify the (relatively) flat tree into truncated form, such as below:

  • MaybeEmphasis (1st, can open)
  • MaybeEmphasis (2nd, can open)
  • Text (Foo)

This is what I wanted to say by #522 (comment):

For example, this new method is used when turning __/foo {.bar/__/} into __/foo, since {.bar__} suffix should be removed as an attribute block.


Ideally, the suffix truncation should be done before parse_line, but it seems parse_line only accepts half-open byte slice.
Making it accept closed range of the bytes might be large diff, and I tried to avoid making such big change that would affect every part of the parsing.

Should I modify parse_line to accept not only start: usize but also end: Option<usize>?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation, I had not considered that parse_line already adds these tree elements! It makes total sense why you do these things now.

However, I do think that it would be cleaner to scan for the attribute block first (if the option is set!) and only after that calling the parse_line function on the prefix. For two reasons: we wouldn't waste effort parsing the header as if it were normal markdown, and truncating the header text would become trivial.

We would indeed have to adjust parse_line it seems, and adding an optional end the way you suggest looks like a good way of doing it! 👍🏽

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried rewriting parse_atx_heading() without truncate_siblings() (by trimming the attribute block before parsing the line) and succeeded.

However, it seems hard to do the same for parse_setext_heading() since it transforms an already-parsed paragraph into a heading.
At the time lines are being parsed, the parser cannot know whether they are setext heading content.
So it will be necessary to modify already-parsed tree in that case and I'll continue to use truncate_siblings() for setext heading.

@marcusklaas
Copy link
Collaborator

Very good progress on this pull request, and a special kudos for the excellent test suite you have created!

I went through all the technical details this time, and left a few more comments. Most of them are small, with the only real significant thing to consider is wrapping the HeadingIndex in an Option.

This is mainly changes on types. Parsers for ID fragments and classes
are not yet implemented.
@lo48576
Copy link
Contributor Author

lo48576 commented Nov 22, 2021

Diffs since the previous push: https://github.com/raphlinus/pulldown-cmark/compare/4ce6d11dff48f4549c585400ac5a137269103a6b..ab1c47abc00acbe8ec9311e671aff316ab95c776

  • Rebased onto new master.
  • Add more tests and docs for whitespaces after the heading content, after the attribute blocks, and inside attribute blocks.
  • Fix inconsistency for "# H1\t{}" case.
    • Previously it was converted to "<h1>H1</h1>", but now it is converted to "<h1>H1\t</h1>".
    • "# H1\t" is converted to "<h1>H1\t</h1>", so the trimming became consistent now.
  • Use Option<HeadingIndex> rather than mandatory HeadingIndex.
    • Thanks to NonZeroUsize backend, type size of ItemBody doesn't change.
  • Split changes to test generator into another PR Sort auto-generated tests modules by spec names #546.

@lo48576
Copy link
Contributor Author

lo48576 commented Nov 23, 2021

  • Fix inconsistent close tags in expected HTML outputs in the spec file.
    • Test suite parses HTML semantically and normalizes it, so inconsistent close tags (such as <h1>foo</h2>) doesn't cause test failures and I overlooked mistakes.
  • Fix wrong wording.
    • If an ATX heading have closing #s, s/attribute blocks are placed before that/the attribute block should be placed after them/.
  • Add a test case with closing #s after {...}: #### non-attribute-block {#id4} ####.

src/firstpass.rs Outdated Show resolved Hide resolved
@marcusklaas
Copy link
Collaborator

marcusklaas commented Dec 1, 2021

Sorry for the long delay on my next pass here, Yoshioka. It's been a very busy week at work.

The good news is that this that besides the one comment I have on the parse_line issue, this PR is in excellent shape! If you could touch up that final point, I will merge it :-)

Thank you so much for your hard work and your patience. This will be a much awaited contribution to pulldown!

@lo48576 lo48576 force-pushed the feature/heading-attrs branch 2 times, most recently from 1816c7d to e1444b4 Compare December 2, 2021 14:40
@lo48576
Copy link
Contributor Author

lo48576 commented Dec 2, 2021

Diff from the previous push: https://github.com/raphlinus/pulldown-cmark/compare/966aefce25ecfa55fd43e3dbaf0f90b478ff9073..a17ae72a0a758e7da237c9c05ee378ef21a0eeba

  • In ATX headings, trim attribute blocks before parsing the line.
  • Fix inconsistent behaviour for trailing backslash in headings.
  • Reduce if-else-if chain and nest level in Tree::<Item>::truncate_siblings() by early return.

About trailing backslash in headings

Comparisons: https://babelmark.github.io/?text=%23+Foo+%5C%0ABar+%5C%0A--

commonmark-compliant parsers seem to preserve the trailing stray backslash: for example # Foo \ into <h1>Foo \</h1>.
However, pulldown-cmark (at the master branch) currently trims them.
I updated parsing of ATX headings to avoid use of truncate_siblings() and the stray backslash problem is fixed for ATX headings with attribute blocks as a side effect, so I also fixed it also for ATX headings without attribute blocks and setext headings.

@marcusklaas
Copy link
Collaborator

Just ran some benchmarks as a final check, and it seems this PR does not change the performance of the library in a measurable way. Let's merge this thing! 🚀

Thanks again for your hard work, patience and substantial contribution to the library @lo48576!

@marcusklaas marcusklaas merged commit e6c0577 into pulldown-cmark:master Dec 6, 2021
@lo48576 lo48576 deleted the feature/heading-attrs branch December 6, 2021 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants