-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support heading attribute block (especially ID and classes) #522
Support heading attribute block (especially ID and classes) #522
Conversation
Should classes support be merged at once, or be separate pull request? Another option is, split attributes to a dedicated Any ideas? |
3f61b99
to
a13d1b1
Compare
Hmm... Maybe attribute block parsing should be common dedicated function? |
Now attribute block content is parsed by |
dbd6836
to
9e0e45d
Compare
Added a test for attribute blocks for ATX heading with trailing |
Being able to add attributes, would be great (especially, if all Markdown elements are supported). Maybe other attributes, besides 'class' and 'id', can be supported as well? Something like:
This would make Markdown more powerful, and could help to make content more accessible (e.g. via 'aria' attributes). |
Yeah it is possible, but doing so requires public API to be changed breakingly (both from the current master, and from the code changed by this pull request), and deciding syntax would be harder since they might be less widely used. For example, what should
Discussion will be necessary to decide what the processor should do, and I've decided to avoid these problems.
So, I'm not going to support more attributes in this pull request. |
And additionally, this PR only supports heading as the first step (because I strongly want this). There would be something more to think if we want to support the syntax for all (or many) elements. |
(Somehow the GitHub notification ended up in my SPAM folder...) Thanks for your response, @lo48576
Sure, but this PR has already a breaking change. Maybe this could be activated via a feature flag? Not sure, if that would be an ideal solution.
The best option seems, to just pass this through as is, and let the browser handle it (which is ignoring the first attribute, I believe). Another option would be, to forbid duplicate attributes (however, I use duplicate attributes sometimes during debugging, so I'd prefer the former).
I think both should be supported. Something like: {#ul-id}
- foo {.ul-item}
- bar {.ul-item}
- baz {.ul-item}
Personally, I'd like this to fail.
Alright, got it. |
Are there issues with this PR, or is it just about the breaking change in the AST? Side-issueIs there something like a version for the AST, or a checksum, or tags, or key-value pairs? |
@hoijui Sorry I don't understand what you are worrying about, but if you are thinking about the possibility of silent breakage of downstream crates (i.e. crates depending on pulldown-cmark), it may not happen thanks to semantic versioning. This pull request has breaking changes around the AST, but this repo already seemed to have breaking changes since the last release (0.8.0) before this pull request starts, so I believe this pull request does not break the downstreams silently (as the next release would be 0.9.0). |
Indeed, semantic versioning could be there for the rescue! :-) Non-the less, meta-data annotated AST still makes sense, but it is of course a question of trade of between how much benefit it brings, and the added complexity; which is not trivial. I will try your branch of this PR (https://github.com/lo48576/pulldown-cmark/tree/feature/heading-attrs) as a dependency in our project, so it will get some real world testing from us too. |
Great feature, I'd love to see this merged! One thing: I think it would be nice to not require white-space separating the id and classes. E.g. |
@kaj I don't find the information that
Such IDs won't be recommended, but they are certainly valid and the converter shouldn't forbid users use such IDs. |
True, but it is not the result I would expect. And |
Ok, I understand.
Yes this is invalid, but I'm not sure the markdown processor should work as an validator in this case. I think the same applies to |
I've never seen anything that would interpret I'm not sure why it would be bad to not support the discouraged practice of using certain special characters in identifiers. Dots ('.') are extremely uncommon to use in ID or CSS class names because this character has a special meaning in CSS (it starts a CSS class name selector), which has to be escaped if used as part of an identifier. If someone really has to use an ID or CSS class name that contains a dot, then they still can do this via raw HTML. I don't think, it would be a good idea to decrease the user experience – and the expected output – for everybody just because of these extremely unlikely corner cases. I think a good compromise, if you really want to support dots in identifiers, would be to support escape sequences:
|
You are right.
I mean
Thank you for the info. I'm not aware of them since I've only seen people separating attributes with spaces. |
So the I feel that alone should be enough reason to do it this way. I haven't curated a list of tools that use the
Here is an online playground. If you type in
Unfortunately, Pandoc does what this PR currently does. And markdown-it-attrs does the same... I still think it's a mistake to follow them and to not do what most other tools that support the As I wrote above, the
Backslashes are the way to escape things in Markdown/CommonMark (and almost everywhere else), so I don't get what the problem is here. |
I don't think markdown's extension syntax should be only borrowed from other language's syntax. I said "I don't want to make this implementation a new dialect" in the previous post. "Space-separated attributes are expanded. That's all."-rule is quite simple, and users can quite easily imagine how it will work. However, "attributes are written in original syntax similar to CSS, but it can be also space-separated."-rule will harder to imagine correctly, and the gap between the users' imagination and the actual implementation may burden users. More rules for "convenience" makes the specification more difficult to understand, and it may make "extremely unlikely corner cases" more likely. |
I found useful site just now, and compared how It seems like conversion to |
Alight. I'm not really interested in spending too much time on discussing this. I think both options have advantages and disadvantages. The current version will confuse some users because the output will be unexpected to them, but they will figure it out quickly on their own. If the documentation mentions the requirement for whitespace clearly, it will be even less of a problem. |
Thank you, added a doc comments about that. |
It seems like this is the problem on the CI settings side, not the code side. |
Added test cases for usual usage, where inlines are used in headings. |
src/parse.rs
Outdated
@@ -79,7 +79,7 @@ pub(crate) enum ItemBody { | |||
TaskListMarker(bool), // true for checked | |||
|
|||
Rule, | |||
Heading(HeadingLevel), // heading level | |||
Heading(HeadingLevel, HeadingIndex), // heading level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be faster for headings without id or classes if we'd wrap the HeadingIndex
in an option? Then we may be able to skip allocations and indirections for that case. It does add another branch, but I suspect it would pay off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively using Option<HeadingIndex>
makes ItemBody
bigger (24 bytes on x64).
It can be easily avoided by using NonZeroUsize
as the underlying type of HeadingIndex
, instead of usize
.
I'm going to modify HeadingIndex
to use NonZeroUsize
as I think it is worth doing, but tell me if you think the overhead of using NonZeroUsize
is unacceptable.
}</h2> | ||
```````````````````````````````` | ||
|
||
Note that escaping only opening braces does not prevent the braces from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this would be a pain to parse properly, but this is mildly surprising. Do other parsers have the same behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
babelmark3 | Compare Markdown Implementations
No parsers seems to be able to recognize multi-line attribute blocks.
I don't know how other implementations not supported by babelmark3 would handle such cases, but I think it would be safe to assume no users write attribute blocks in multiline form (as processors cannot handle such attributes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, I meant escaping opening braces!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
babelmark3 | Compare Markdown Implementations
- Opening braces can be escaped: php-markdown-extra 1.9.1, pandoc 2.16.2
- Escape doesn't work (same as current impl.): markdig (advanced) 0.26.0.0, maruku 0.7.3.beta1
There seems no common behavior...
It would be natural if \{ ... }
is not parsed as an attribute block.
And it'll not so hard to support such escapes, since {
itself is prohibited in attribute blocks and we don't need to care multiple escapes in various contexts.
I have no strong preference for how this should be parsed. Both of "attribute blocks are trimmed first, before parsing" and "escaped character doesn't behave specially" sounds natural to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, I remembered why I avoided supporting escaping.
As I wrote at #522 (comment), it is hard to decide how # h1 {.foo\{.bar}
should be converted.
- Users' expectation with escaping support would be: content=
h1
, class=foo{.bar
.- The second
{
is escaped so should be treated as a plain character, and then the first{
is opening of the block!
- The second
- However, with escaping support, this would be converted to: content=
h1 {.foo{.bar}
, class=None.- This is because
{
cannot appear in attribute blocks.
- This is because
- Current implementation: content=
h1 {.foo\
, class=bar
.- This is because
{.bar}
is literally trimmed before processing escapes.
- This is because
I don't like this ambiguity or unexpectedness when escaping opening braces is allowed, so I decided to trim the block before processing any escapes.
But I'm not confident this is better, so I'll change implementation if you have some preference.
specs/heading_attrs.txt
Outdated
````````````````````````````````.example | ||
# H1__ {#my__id1} | ||
## H2** {#my**id2} | ||
### H3 ` {.code` } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent test case. I do wonder if we are in line with other parsers though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
babelmark3 | Compare Markdown Implementations
- Same:
- markdig 0.26.0.0 and php-markdown-extra 1.9.0 converts to the same result as this PR.
- Similar:
- maruku 0.7.3.beta1 seems to try to do the same, but fails with handling of the unclosed inlines (unclosed
__baz
and__qux
are converted to<strong>...
).
- maruku 0.7.3.beta1 seems to try to do the same, but fails with handling of the unclosed inlines (unclosed
- Different:
- cebe/extra 1.2.0, pandoc 2.16.2, and kramdown 1.2.0 gave up on
# bar __baz {#my__id2}
case. - cebe/extra converts
__baz {#my__
into<strong>...</strong>
. - pandoc seems to fail to convert inlines, so
__baz {#my__
is written as is. - kramdown even fails to make it a heading.
- cebe/extra 1.2.0, pandoc 2.16.2, and kramdown 1.2.0 gave up on
There seems no "common" way to handle this, but there are roughly two patterns: "blocks are trimmed first" and "give up when ambiguous".
(I'm a bit surprised that there is no "inlines takes precedence" pattern.)
I think it is better to use simple rule "attribute blocks are trimmed first", since:
- Class names such as
.foo__bar
would be widely used and I want them to be writable as{ .foo__bar }
.- I don't think it is good if programatically added
{.foo__bar}
makes the parsing "ambiguous".
- I don't think it is good if programatically added
- "Give up" patterns have variations, and it is less clear what (should) happens when "ambiguous".
src/tree.rs
Outdated
impl Tree<Item> { | ||
/// Truncates the preceding siblings to the given end position, | ||
/// and returns the new current node. | ||
pub(crate) fn truncate_siblings(&mut self, end_byte_ix: usize) -> Option<TreeIndex> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part of the PR I don't fully understand. Why do we need this? I understand we need to truncate the header texts, but it's all one node right? It doesn't get split into multiple nodes until the second parsing phase. Why can't we just adjust the header end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because:
- Truncating prefix of the header before parsing seems easy, but truncating suffix would be non-trivial.
- Parsing by
FirstPass::parse_line
seems to include some preparation for the second pass.
Consider the input is # **Foo {#bar** }
.
This is ATX heading so FirstPass::parse_atx_heading
is called.
At line 1012 on firstpass.rs (master branch), text[ix..]
is **Foo {#bar** }
, and the parser tries to parse and get a line.
This is, (1) truncating prefix is easy but truncating suffix won't.
Then, FirstPass::parse_line
actually does more job than getting a line.
It creates a tree for the input **Foo {#bar** }\n...
(half-open range) as below:
- MaybeEmphasis (1st, can open)
- MaybeEmphasis (2nd, can open)
- Text (
Foo {#bar
) - MaybeEmphasis (1st, can close)
- MaybeEmphasis (2nd, can close)
- Text (
}
)
Second pass utilizes this MaybeEmphasis
's to create real tree structure.
This is, (2) parsing by FirstPass::parse_line
seems to include some preparation for the second pass.
In my current implementation, I use parse_line
same way as before.
And after that parsing, modify the (relatively) flat tree into truncated form, such as below:
- MaybeEmphasis (1st, can open)
- MaybeEmphasis (2nd, can open)
- Text (
Foo
)
This is what I wanted to say by #522 (comment):
For example, this new method is used when turning
__
/foo {.bar
/__
/}
into__
/foo
, since{.bar__}
suffix should be removed as an attribute block.
Ideally, the suffix truncation should be done before parse_line
, but it seems parse_line
only accepts half-open byte slice.
Making it accept closed range of the bytes might be large diff, and I tried to avoid making such big change that would affect every part of the parsing.
Should I modify parse_line
to accept not only start: usize
but also end: Option<usize>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation, I had not considered that parse_line
already adds these tree elements! It makes total sense why you do these things now.
However, I do think that it would be cleaner to scan for the attribute block first (if the option is set!) and only after that calling the parse_line
function on the prefix. For two reasons: we wouldn't waste effort parsing the header as if it were normal markdown, and truncating the header text would become trivial.
We would indeed have to adjust parse_line
it seems, and adding an optional end the way you suggest looks like a good way of doing it! 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried rewriting parse_atx_heading()
without truncate_siblings()
(by trimming the attribute block before parsing the line) and succeeded.
However, it seems hard to do the same for parse_setext_heading()
since it transforms an already-parsed paragraph into a heading.
At the time lines are being parsed, the parser cannot know whether they are setext heading content.
So it will be necessary to modify already-parsed tree in that case and I'll continue to use truncate_siblings()
for setext heading.
Very good progress on this pull request, and a special kudos for the excellent test suite you have created! I went through all the technical details this time, and left a few more comments. Most of them are small, with the only real significant thing to consider is wrapping the |
This is mainly changes on types. Parsers for ID fragments and classes are not yet implemented.
4ce6d11
to
ab1c47a
Compare
Diffs since the previous push: https://github.com/raphlinus/pulldown-cmark/compare/4ce6d11dff48f4549c585400ac5a137269103a6b..ab1c47abc00acbe8ec9311e671aff316ab95c776
|
ab1c47a
to
966aefc
Compare
|
Sorry for the long delay on my next pass here, Yoshioka. It's been a very busy week at work. The good news is that this that besides the one comment I have on the Thank you so much for your hard work and your patience. This will be a much awaited contribution to pulldown! |
1816c7d
to
e1444b4
Compare
Diff from the previous push: https://github.com/raphlinus/pulldown-cmark/compare/966aefce25ecfa55fd43e3dbaf0f90b478ff9073..a17ae72a0a758e7da237c9c05ee378ef21a0eeba
About trailing backslash in headingsComparisons: https://babelmark.github.io/?text=%23+Foo+%5C%0ABar+%5C%0A-- commonmark-compliant parsers seem to preserve the trailing stray backslash: for example |
e1444b4
to
a17ae72
Compare
Just ran some benchmarks as a final check, and it seems this PR does not change the performance of the library in a measurable way. Let's merge this thing! 🚀 Thanks again for your hard work, patience and substantial contribution to the library @lo48576! |
TODO
{#id}
){.class}
is simply ignored.Options::ENABLE_HEADING_ATTRIBUTES
.{.class1 .class2}
)While this is WIP branch, feel free to give me advice.Now this branch is ready to merge.
Summary
By this patch, section headings will be able to have ID and classes.
For example,
# H1 {#id1 .heading}
would be converted to<h1 id="id1" class="heading">H1</h1>
.This is a breaking change:
Tag::Heading
will have multiple fields (ID and classes) instead of singleHeadingLevel
.This solves #424.