Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md_in_html: broken code span #1068

Closed
dbader opened this issue Nov 16, 2020 · 14 comments
Closed

md_in_html: broken code span #1068

dbader opened this issue Nov 16, 2020 · 14 comments

Comments

@dbader
Copy link
Contributor

dbader commented Nov 16, 2020

markdown.extensions.md_in_html fails to escape HTML tags in monospace text placed inside a markdown="1" div wrapper:

>>> import markdown
>>> markdown.markdown('<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["markdown.extensions.md_in_html"])
'<p><h1><div markdown="block">\n`escaped&lt;/h1&gt;`\n</div></p>'
<p><h1><div markdown="block">
`escaped&lt;/h1&gt;`
</div></p>

The inner <code> element should look like this: <code>&lt;h1&gt;escaped&lt;/h1&gt;</code>, but instead the h1 inside the monospace text appears as an actual <h1> tag in the output.

Versions

  • Markdown==3.3.3
  • CPython 3.8.6
@dbader
Copy link
Contributor Author

dbader commented Nov 16, 2020

Interestingly this seem to work correctly on Markdown==3.2.2:

>>> import markdown
>>> markdown.markdown('<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["markdown.extensions.md_in_html"])
'<div>\n<p><code>&lt;h1&gt;escaped&lt;/h1&gt;</code></p>\n</div>

@dbader
Copy link
Contributor Author

dbader commented Nov 16, 2020

Btw I super appreciate the work you guys are doing here ❤️ I'm using Python-Markdown for realpython.com (also your amazing PyMdown extensions @facelessuser) and it's a pleasure using this library. Let me know if I can provide additional info here to help track this down 🙂

@facelessuser
Copy link
Collaborator

Interestingly this seems to work correctly on Markdown==3.2.2

This is because a new HTML parser was introduced in 3.3

Though the first wave of bugs was the kind I expected, I'm starting to get a little concerned about the new parser. The handling of block elements in inline code is a little troubling, coupled with some of the recent bugs.

As far as the code block part goes, that is one thing the old parser took into consideration. It understood that it didn't have all the context it needed as Python Markdown actually takes multiple passes while some other parsers tokenized everything in one pass.

Unfortunately, since we do not tokenize everything in one pass, I really do think block HTML logic should only come into play when the block tag is at the start of a line. We should only process inline tags once we've processed a Markdown block with the code step.

It may be that we pulled the trigger too soon on the HTML parser, but I understand why we did as at the time it was passing all the known tests. We are running into scenarios that we just didn't have tests for that we probably should have.

I'm curious about @waylan's opinion on the recent issues, and how we should move forward with the latest HTML parser.

@waylan
Copy link
Member

waylan commented Nov 16, 2020

@facelessuser, I agree with and share your concerns and assessment. I thought we had good test coverage. However, it is looking more and more like that is not the case.

We should only process inline tags once we've processed a Markdown block with the code step.

While that would be an ideal approach, we can't tell the HTML parser to only parse this and ignore that. It parses everything we pass it. What we have done is then check each token and if it is not a block-level tag, handle it differently. Apparently, there are some cases we aren't covering.

If we really want to not parse non-block level HTML at all at this stage, then we need to abandon use of html.parser.

@waylan
Copy link
Member

waylan commented Nov 16, 2020

BTW, I'm not seeing the reported behavior. Instead I get this output:

<div>
<p>`</p>
<h1>escaped</h1>
<p>`</p>
</div>

And without the md_in_html extension, I get:

<div markdown="1">
`<h1>escaped</h1>`
</div>

which is correct. My guess is that the extension fails to replicate the logic in the core which accounts for the tag not being at the start of the line.

@waylan waylan changed the title markdown.extensions.md_in_html: broken monospace HTML escaping md_in_html: broken code span Nov 17, 2020
@waylan
Copy link
Member

waylan commented Nov 17, 2020

Okay, this is really weird. I'm getting different behavior from a string literal than from a normal string.

>>> markdown.markdown('<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["md_in_html"])
'<p><h1><div markdown="block">\n`escaped&lt;/h1&gt;`\n</div></p>'
>>> src = """
... <div markdown="1">
... `<h1>escaped</h1>`
... </div>
... """
>>> markdown.markdown(src, extensions=["md_in_html"])
'<div>\n<p>`</p>\n<h1>escaped</h1>\n<p>`</p>\n</div>'

Turns out the newline before the opening <div> tag is causing different behavior.

>>> markdown.markdown('\n<div markdown="1">\n`<h1>escaped</h1>`\n</div>', extensions=["md_in_html"])
'<div>\n<p>`</p>\n<h1>escaped</h1>\n<p>`</p>\n</div>'

@waylan
Copy link
Member

waylan commented Nov 17, 2020

The bug which is mixing up the order of the elements was introduced in 2766698. Without that commit, we consistently get the output:

<div>
<p>`</p>
<h1>escaped</h1>
<p>`</p>
</div>

@facelessuser
Copy link
Collaborator

Ugh, did I break it? 😞

I'll have to take a look then and see where things went wrong.

@waylan
Copy link
Member

waylan commented Nov 17, 2020

So this is directly related to the issue:

def at_line_start(self):
"""At line start."""
value = super().at_line_start()
if not value and self.cleandoc and self.cleandoc[-1].endswith('\n'):
value = True
return value

That if statement on lines 93 & 94 prevents the issue from happening in the case where a newline precedes the div. Removing those 2 lines causes the error to occur consistently everywhere. However, before those lines were added, the error didn't occur, so something else in 2766698 introduced the error. It was only hidden by lines 93 & 94. I suspect that if we can remove the error, we can completely remove lines 93 & 94.

@waylan
Copy link
Member

waylan commented Nov 17, 2020

if not value and self.cleandoc and self.cleandoc[-1].endswith('\n'):

I never understood why you added that line. And thinking about it now, it still doesn't make sense.

For example, suppose a starttag is at the begging of the document. Then self.cleandoc would be empty, causing the entire statement to be False, but it should still equate to True. In fact, this is exactly what is causing the immediate issue.

super().at_line_start uses self.rawdata (the original source), not the processed output to determine position. Shouldn't that be what we use here? In what scenario should we act as if we are at the start of a line when we are not in the original source?

@facelessuser
Copy link
Collaborator

facelessuser commented Nov 17, 2020

I'll take a look and reevaluate. I'll have to refresh myself on the issue I was trying to avoid.

@facelessuser
Copy link
Collaborator

For example, suppose a starttag is at the begging of the document. Then self.cleandoc would be empty, causing the entire statement to be False, but it should still equate to True. In fact, this is exactly what is causing the immediate issue.

super().at_line_start uses self.rawdata (the original source), not the processed output to determine position. Shouldn't that be what we use here? In what scenario should we act as if we are at the start of a line when we are not in the original source?

While I don't have data right now, I do know I was seeing an issue. I think at_line_start maybe wasn't always set when expecting. I guess it is possible though that I was mistaken. I'll know better when I take a closer look.

@waylan
Copy link
Member

waylan commented Nov 17, 2020

I worked out the issue. You were trying to account for tails. Rather than following the method used in the core, you devised a different approach. I have addressed both that and the present issue in #1069. Although, at present, there are still a few failing tests.

@facelessuser
Copy link
Collaborator

Awesome. Yeah, there are still some things I wasn't sure about with the new parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants