md_in_html: broken `code span` #1068

dbader · 2020-11-16T18:01:16Z

markdown.extensions.md_in_html fails to escape HTML tags in monospace text placed inside a markdown="1" div wrapper:

>>> import markdown
>>> markdown.markdown('<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["markdown.extensions.md_in_html"])
'<p><h1><div markdown="block">\n`escaped&lt;/h1&gt;`\n</div></p>'

<p><h1><div markdown="block">
`escaped&lt;/h1&gt;`
</div></p>

The inner <code> element should look like this: <code><h1>escaped</h1></code>, but instead the h1 inside the monospace text appears as an actual <h1> tag in the output.

Versions

Markdown==3.3.3
CPython 3.8.6

The text was updated successfully, but these errors were encountered:

dbader · 2020-11-16T18:06:57Z

Interestingly this seem to work correctly on Markdown==3.2.2:

>>> import markdown
>>> markdown.markdown('<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["markdown.extensions.md_in_html"])
'<div>\n<p><code>&lt;h1&gt;escaped&lt;/h1&gt;</code></p>\n</div>

dbader · 2020-11-16T18:10:35Z

Btw I super appreciate the work you guys are doing here ❤️ I'm using Python-Markdown for realpython.com (also your amazing PyMdown extensions @facelessuser) and it's a pleasure using this library. Let me know if I can provide additional info here to help track this down 🙂

facelessuser · 2020-11-16T19:15:57Z

Interestingly this seems to work correctly on Markdown==3.2.2

This is because a new HTML parser was introduced in 3.3

Though the first wave of bugs was the kind I expected, I'm starting to get a little concerned about the new parser. The handling of block elements in inline code is a little troubling, coupled with some of the recent bugs.

As far as the code block part goes, that is one thing the old parser took into consideration. It understood that it didn't have all the context it needed as Python Markdown actually takes multiple passes while some other parsers tokenized everything in one pass.

Unfortunately, since we do not tokenize everything in one pass, I really do think block HTML logic should only come into play when the block tag is at the start of a line. We should only process inline tags once we've processed a Markdown block with the code step.

It may be that we pulled the trigger too soon on the HTML parser, but I understand why we did as at the time it was passing all the known tests. We are running into scenarios that we just didn't have tests for that we probably should have.

I'm curious about @waylan's opinion on the recent issues, and how we should move forward with the latest HTML parser.

waylan · 2020-11-16T21:15:23Z

@facelessuser, I agree with and share your concerns and assessment. I thought we had good test coverage. However, it is looking more and more like that is not the case.

We should only process inline tags once we've processed a Markdown block with the code step.

While that would be an ideal approach, we can't tell the HTML parser to only parse this and ignore that. It parses everything we pass it. What we have done is then check each token and if it is not a block-level tag, handle it differently. Apparently, there are some cases we aren't covering.

If we really want to not parse non-block level HTML at all at this stage, then we need to abandon use of html.parser.

waylan · 2020-11-16T21:27:08Z

BTW, I'm not seeing the reported behavior. Instead I get this output:

<div>
<p>`</p>
<h1>escaped</h1>
<p>`</p>
</div>

And without the md_in_html extension, I get:

<div markdown="1">
`<h1>escaped</h1>`
</div>

which is correct. My guess is that the extension fails to replicate the logic in the core which accounts for the tag not being at the start of the line.

waylan · 2020-11-17T14:13:44Z

Okay, this is really weird. I'm getting different behavior from a string literal than from a normal string.

>>> markdown.markdown('<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["md_in_html"])
'<p><h1><div markdown="block">\n`escaped&lt;/h1&gt;`\n</div></p>'
>>> src = """
... <div markdown="1">
... `<h1>escaped</h1>`
... </div>
... """
>>> markdown.markdown(src, extensions=["md_in_html"])
'<div>\n<p>`</p>\n<h1>escaped</h1>\n<p>`</p>\n</div>'

Turns out the newline before the opening <div> tag is causing different behavior.

>>> markdown.markdown('\n<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["md_in_html"])
'<div>\n<p>`</p>\n<h1>escaped</h1>\n<p>`</p>\n</div>'

waylan · 2020-11-17T14:28:36Z

The bug which is mixing up the order of the elements was introduced in 2766698. Without that commit, we consistently get the output:

<div>
<p>`</p>
<h1>escaped</h1>
<p>`</p>
</div>

facelessuser · 2020-11-17T14:36:10Z

Ugh, did I break it? 😞

I'll have to take a look then and see where things went wrong.

waylan · 2020-11-17T15:23:02Z

So this is directly related to the issue:

markdown/markdown/extensions/md_in_html.py

Lines 89 to 95 in 2766698

    
           def at_line_start(self): 
        
               """At line start.""" 
        
               value = super().at_line_start() 
        
               if not value and self.cleandoc and self.cleandoc[-1].endswith('\n'): 
        
                   value = True 
        
               return value

That if statement on lines 93 & 94 prevents the issue from happening in the case where a newline precedes the div. Removing those 2 lines causes the error to occur consistently everywhere. However, before those lines were added, the error didn't occur, so something else in 2766698 introduced the error. It was only hidden by lines 93 & 94. I suspect that if we can remove the error, we can completely remove lines 93 & 94.

waylan · 2020-11-17T16:00:28Z

if not value and self.cleandoc and self.cleandoc[-1].endswith('\n'):

I never understood why you added that line. And thinking about it now, it still doesn't make sense.

For example, suppose a starttag is at the begging of the document. Then self.cleandoc would be empty, causing the entire statement to be False, but it should still equate to True. In fact, this is exactly what is causing the immediate issue.

super().at_line_start uses self.rawdata (the original source), not the processed output to determine position. Shouldn't that be what we use here? In what scenario should we act as if we are at the start of a line when we are not in the original source?

facelessuser · 2020-11-17T16:04:59Z

I'll take a look and reevaluate. I'll have to refresh myself on the issue I was trying to avoid.

facelessuser · 2020-11-17T16:19:29Z

For example, suppose a starttag is at the begging of the document. Then self.cleandoc would be empty, causing the entire statement to be False, but it should still equate to True. In fact, this is exactly what is causing the immediate issue.

super().at_line_start uses self.rawdata (the original source), not the processed output to determine position. Shouldn't that be what we use here? In what scenario should we act as if we are at the start of a line when we are not in the original source?

While I don't have data right now, I do know I was seeing an issue. I think at_line_start maybe wasn't always set when expecting. I guess it is possible though that I was mistaken. I'll know better when I take a closer look.

waylan · 2020-11-17T20:28:21Z

I worked out the issue. You were trying to account for tails. Rather than following the method used in the core, you devised a different approach. I have addressed both that and the present issue in #1069. Although, at present, there are still a few failing tests.

facelessuser · 2020-11-17T20:50:31Z

Awesome. Yeah, there are still some things I wasn't sure about with the new parser.

waylan changed the title ~~markdown.extensions.md_in_html: broken monospace HTML escaping~~ md_in_html: broken code span Nov 17, 2020

waylan mentioned this issue Nov 17, 2020

Properly parse code spans in md_in_html #1069

Merged

waylan mentioned this issue Nov 18, 2020

unclosed tag in code span #1066

Closed

waylan closed this as completed in 81cc5b8 Nov 18, 2020

waylan mentioned this issue Oct 23, 2023

Speedup line_offset property #1392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

md_in_html: broken `code span` #1068

md_in_html: broken `code span` #1068

dbader commented Nov 16, 2020 •

edited

Loading

dbader commented Nov 16, 2020

dbader commented Nov 16, 2020

facelessuser commented Nov 16, 2020

waylan commented Nov 16, 2020

waylan commented Nov 16, 2020

waylan commented Nov 17, 2020

waylan commented Nov 17, 2020

facelessuser commented Nov 17, 2020

waylan commented Nov 17, 2020 •

edited

Loading

waylan commented Nov 17, 2020

facelessuser commented Nov 17, 2020 •

edited

Loading

facelessuser commented Nov 17, 2020

waylan commented Nov 17, 2020

facelessuser commented Nov 17, 2020

md_in_html: broken code span #1068

md_in_html: broken code span #1068

Comments

dbader commented Nov 16, 2020 • edited Loading

Versions

dbader commented Nov 16, 2020

dbader commented Nov 16, 2020

facelessuser commented Nov 16, 2020

waylan commented Nov 16, 2020

waylan commented Nov 16, 2020

waylan commented Nov 17, 2020

waylan commented Nov 17, 2020

facelessuser commented Nov 17, 2020

waylan commented Nov 17, 2020 • edited Loading

waylan commented Nov 17, 2020

facelessuser commented Nov 17, 2020 • edited Loading

facelessuser commented Nov 17, 2020

waylan commented Nov 17, 2020

facelessuser commented Nov 17, 2020

md_in_html: broken `code span` #1068

md_in_html: broken `code span` #1068

dbader commented Nov 16, 2020 •

edited

Loading

waylan commented Nov 17, 2020 •

edited

Loading

facelessuser commented Nov 17, 2020 •

edited

Loading