Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange and inconsistent parsing of lists with headers and multiple lines #1433

Closed
Andre601 opened this issue Jan 19, 2024 · 6 comments
Closed
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.

Comments

@Andre601
Copy link

A strange parsing behaviour can be observed with the markdown library when used on Lists containing headers and multiple lines.

Problem

When a list entry contains a header and subsequent text using indents of 4 spaces, will the first entry render fine.
However, any subsequent entry will have their text after the header be rendered as code blocks, due to the behaviour of turning lines with 4 spaces indent into code blocks.

This behaviour strangely enough can only be observed if there is a gap in-between two list entries. If they are right after each other will the result be as expected (That being text rendering as normal paragraph).
Even stranger is the behaviour with any additional paragraphs. Should a list entry have more than one paragraph, meaning there is a gap between the first and second text after the header, will it render the second paragraph fine, will still having this odd behaviour on the first.

See my tests below for possible results.

There are fixes/workarounds for this.

The first one being to have no gaps in-between the list entries. This is the easiest in terms of keeping consistency, but may worsen readability of the raw content.

The second option is to start the first paragraph with 2 indents instead of 4. This may cause visual inconsistencies in the raw text if an entry has more than one paragraph, by having different indents, since the second paragraph needs to have 4 indents or else it won't be included as part of the list entry.


Tests

These tests were all made using python -m markdown test.txt with the test.txt file containing the below displayed markdown content.

Test 1

Base-line test showing the issue.

Input:

- ### List 1
    Entry 1.1

- ### List 2
    Entry 2.1

- ### List 3
    Entry 3.1

Output:

<ul>
<li>
<h3>List 1</h3>
<p>Entry 1.1</p>
</li>
<li>
<h3>List 2</h3>
<pre><code>Entry 2.1
</code></pre>
</li>
<li>
<h3>List 3</h3>
<pre><code>Entry 3.1
</code></pre>
</li>
</ul>

Test 2

Test with multiple paragraphs

Input:

- ### List 1
    Entry 1.1
    
    Entry 1.2

- ### List 2
    Entry 2.1
    
    Entry 2.2

- ### List 3
    Entry 3.1
    
    Entry 3.2

Output:

<ul>
<li>
<h3>List 1</h3>
<p>Entry 1.1</p>
<p>Entry 1.2</p>
</li>
<li>
<h3>List 2</h3>
<pre><code>Entry 2.1
</code></pre>
<p>Entry 2.2</p>
</li>
<li>
<h3>List 3</h3>
<pre><code>Entry 3.1
</code></pre>
<p>Entry 3.2</p>
</li>
</ul>

Test 3

Test 1, but with spaces between entries removed.

Input:

- ### List 1
    Entry 1.1
- ### List 2
    Entry 2.1
- ### List 3
    Entry 3.1

Output:

<ul>
<li>
<h3>List 1</h3>
<p>Entry 1.1</p>
</li>
<li>
<h3>List 2</h3>
<p>Entry 2.1</p>
</li>
<li>
<h3>List 3</h3>
<p>Entry 3.1</p>
</li>
</ul>

Test 4

Test 2, but with indents adjusted for first paragraph (Note the render issue with Entry 1).

Input:

- ### List 1
  Entry 1.1
    
    Entry 1.2

- ### List 2
  Entry 2.1
    
    Entry 2.2

- ### List 3
  Entry 3.1
    
    Entry 3.2

Output:

<ul>
<li>
<h3>List 1</h3>
  Entry 1.1<p>Entry 1.2</p>
</li>
<li>
<h3>List 2</h3>
<p>Entry 2.1</p>
<p>Entry 2.2</p>
</li>
<li>
<h3>List 3</h3>
<p>Entry 3.1</p>
<p>Entry 3.2</p>
</li>
</ul>

Test 5

Test 4, but first entry does not have its indents adjusted.

Input:

- ### List 1
    Entry 1.1
    
    Entry 1.2

- ### List 2
  Entry 2.1
    
    Entry 2.2

- ### List 3
  Entry 3.1
    
    Entry 3.2

Output:

<ul>
<li>
<h3>List 1</h3>
<p>Entry 1.1</p>
<p>Entry 1.2</p>
</li>
<li>
<h3>List 2</h3>
<p>Entry 2.1</p>
<p>Entry 2.2</p>
</li>
<li>
<h3>List 3</h3>
<p>Entry 3.1</p>
<p>Entry 3.2</p>
</li>
</ul>
@Andre601
Copy link
Author

Forgot to add another solution/workaround.
Adding an empty line after the header also prevents the code block issue.

I would assume that this is some block-related rendering behaviour?

@facelessuser
Copy link
Collaborator

I do agree it is weird that there are some cases where the paragraph under the header is getting turned into code blocks. I'm not sure if this is a list issue or a header extension issue within lists. I do know that lists especially have a few quirky issues like this. I do think behavior should be more consistent in lists. The fact that headers handle this case outside of lists fine but have issues in lists should probably be looked into.

With that said, for most consistent behavior, It is always best to keep blocks separate. Generally, Python Markdown expects blocks to have new lines between them.

import markdown

MD = """
-   ### List 1

    Entry 1.1

    Entry 1.2

-   ### List 2

    Entry 2.1

    Entry 2.2

-   ### List 3

    Entry 3.1

    Entry 3.2
"""

html = markdown.markdown(
    MD,
    extensions=[],
)

print(html)
<ul>
<li>
<h3>List 1</h3>
<p>Entry 1.1</p>
<p>Entry 1.2</p>
</li>
<li>
<h3>List 2</h3>
<p>Entry 2.1</p>
<p>Entry 2.2</p>
</li>
<li>
<h3>List 3</h3>
<p>Entry 3.1</p>
<p>Entry 3.2</p>
</li>
</ul>

@waylan
Copy link
Member

waylan commented Jan 19, 2024

I haven't looked closely at each example given yet (I will when I have time), but the first thing I would check is the reference implementation. Is our behavior any different? For any example that our behavior matches the reference implementation, I would expect that to be the correct behavior (unless it is clearly a bug in the reference implementation, which does happen on occasion). If however, the behavior between implementations differs, then we probably have a bug here.

As a general observation, there are a lot of subtleties with list parsing. Especially when you get into differences between tight (blank lines between items) and loose (no blank lines between items) lists. As loose list items always contain block level children, I can see an argument that any list item which contains a heading (which is clearly a block level element) should get loose list behavior even without the blank lines, but that is not how the reference implementation works, so we don't either. I'm assuming that this is what is leading to the unexpected output.

With that said, for most consistent behavior, It is always best to keep blocks separate. Generally, Python Markdown expects blocks to have new lines between them.

This is generally good advice. Yes, it is true that Markdown can work with all sorts of weird edge cases. However, for consistent results across all implementations I always format all of my Markdown according to the strictest linting rules, such as always including a blank line between all block level elements, no matter what. That has become especially important with the popularity of Commonmark, which handles many edge cases differently that old-school Markdown. My Markdown always renders the same with both Commonmark (on GitHub) and Python-Markdown (on my own sites) because I follow those strict linting rules and I avoid the various weird behaviors raised here.

To be clear, I am not suggesting that we shouldn't bother to fix an edge case if the behavior is clearly wrong because it can be avoided by using a stricture set of rules. What I am saying is that because the correct behavior (as defined by Markdown rather vague syntax rules) is not always clear, it is easier to avoid surprises if you stick to those stricter rules. In fact, for the documentation on this project, we run all proposed changes through the linter tool to enforce those stricter rules.

@Andre601
Copy link
Author

Andre601 commented Jan 21, 2024

Something I want to point out real quick.

The linting rule you linked show 2 spaces as proper indent, which is also the default, yet your markdown parser is requiring 4 spaces, no matter what, for proper indents.
Why?

@waylan
Copy link
Member

waylan commented Jan 22, 2024

Had a chance to look at these.

Test 1 and Test 2 both demonstrate the same bug. There should be no code blocks (paragraphs instead). What is really strange is that the first item is correct, but the subsequent items are wrong.

Test 3 looks correct, but when you check against the reference implementation, it is also wrong. I think this one is interesting in that because there are no blank lines, the reference implementation sees it as a tight list. Presumably, the idea is that a tight list item does not contain any block level children. Therefore it is parsed as inline text only. markdown.pl returns the following result:

<ul>
<li>### List 1
Entry 1.1</li>
<li>### List 2
Entry 2.1</li>
<li>### List 3
Entry 3.1</li>
</ul>

According to Babelmark, there is a lot of variability across implementations with this one. Not sure what to think about it. Regardless, I am inclined to not treat this as the same bug. In fact, I may ignore it altogether.

I'm not sure what is going on with Test 4 as a heading should never be more than one line (a heading always ends at the first newline). However, what is even more curious, is that this specific edge case results in the bug in Tests 1 and 2 being avoided. Add the additional indentation, and we get those issues back. It looks like Test 5 is a workaround to avoid the issues. I suspect Tests 4 and 5 will help in working out what is causing the issues in Tests 1 and 2.

Thanks for posting this. This is clearly a bug. A bug I never would have found as I always follow a heading with a blank line in my own documents.

@waylan waylan added bug Bug report. core Related to the core parser code. confirmed Confirmed bug report or approved feature request. labels Jan 22, 2024
@waylan
Copy link
Member

waylan commented Jan 22, 2024

The reason these edge cases are not so clear is because lists support hanging indents. For example, these two list items are parsed the same way:

-   line one of one paragraph
    line 2 of the same paragraph

-   line one of one paragraph
line 2 of the same paragraph 

However, because a heading can only ever be one line, then that forces the second line to start a new paragraph, which is unintuitive. For example, the following two list items get parsed very differently:

- # A Heading
A paragraph in the list item.

- # A Heading

A paragraph outside the list.

Yet, when we take those out of a list, then they get parsed the same.

# A Heading
A paragraph

# A Heading

A paragraph

All of these differences make for a challenge when developing a parser that works consistently and unsurprisingly. An additional complication is that the rules are not comprehensive and some edge cases of the reference implementation don't seem to be consistent with what one might expect having read the rules. I suppose that is why Commonmark completely abandoned the original rules and reimplemented a completely different scheme for parsing list items. But we are not a Commonmark parser, so we are stuck with the weirdness that is old-school Markdown.

waylan added a commit to waylan/markdown that referenced this issue Jan 23, 2024
This is a weird edge case. Normally, detabbing would be handled by
the `ListIndentProcessor`. However, in this one case, that class's
`get_level` method would need to return a different result than in
any other case. As there is no way to easily determine this specific
case from that class, we make the adjustment directly in the
`HashHeaderProcessor` class. Fixes Python-Markdown#1433.
@waylan waylan closed this as completed Jan 29, 2024
@waylan waylan reopened this Jan 29, 2024
@waylan waylan closed this as completed in c334a3e Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. core Related to the core parser code.
Projects
None yet
Development

No branches or pull requests

3 participants