Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unclosed tag in code span #1066

Closed
iamvdo opened this issue Nov 12, 2020 · 19 comments · Fixed by #1072
Closed

unclosed tag in code span #1066

iamvdo opened this issue Nov 12, 2020 · 19 comments · Fixed by #1072
Labels
bug Bug report.

Comments

@iamvdo
Copy link

iamvdo commented Nov 12, 2020

Hello,

I'm facing a problem, not sure where the problem really is (I'm not a Python developer), but I managed to create a little use case showing it.

Using fresh install mkdocs (1.1.2) and markdown 3.3.3 (problem is not here with markdown 3.2.2)

This .md file works

A text

Another text `<?php`.

<div>
hello
</div>

Adding a new line, and it no longer works with error AttributeError: 'NoneType' object has no attribute 'end'

A text

New line

Another text `<?php`.

<div>
hello
</div>

It also fails with other combinations of code tags containing HTML and/or PHP. But this one is the smaller use case I've found.

Thanks!

@facelessuser
Copy link
Collaborator

There seem to be two problems here.

  1. The line offset function expects to always match something, but if it doesn't, it fails. There are probably a couple of ways to fix this, but one way is simply to check for None:

    @property
    def line_offset(self):
        """Returns char index in self.rawdata for the start of the current line. """
        if self.lineno > 1:
            m = re.match(r'([^\n]*\n){{{}}}'.format(self.lineno-1), self.rawdata)
            return 0 if m is None else m.end()
        return 0
  2. After fixing this, another problem becomes apparent, at least with the second example. Usint this source:

    import markdown
    
    src1 = """\
    A text
    
    Another text `<?php`.
    
    <div>
    hello
    </div>
    """
    
    src2 = """\
    A text
    
    New line
    
    Another text `<?php`.
    
    <div>
    hello
    </div>
    """
    
    
    def test(md_text):
        html = markdown.markdown(
            md_text,
            # Same result with or without md_in_html
            extensions=["markdown.extensions.md_in_html"],
        )
        print(html)
    
    
    test(src1)
    print('------')
    test(src2)

    We can see that in the second output the <?php content is posted twice.

    <p>A text</p>
    <p>New line</p>
    <p>Another text <code>&lt;?php</code>.</p>
    <p><div>
    hello
    &lt;?php`.</p>
    <div>
    

I'm not sure what the issue is here as I don't have time to dig further right now. It looks like we may be attempting to handle <?php, but we probably shouldn't be as it isn't at the start of a line, etc.

I would definitely argue this is a bug.

@facelessuser facelessuser added the bug Bug report. label Nov 12, 2020
@facelessuser
Copy link
Collaborator

@waylan while I believe that preprocessing instructions need some tweaks, but looking at this inline code case, I am slightly curious about your thoughts on this case:

some text `some html code we want to wrap <body>
<div>text</div></body>`

Part of the idea in the rewrite was to fix generally buggy handling, but it also added the new functionality where if a tag will be treated as HTML if it is at the start of the line with no empty line before it. The old processor didn't do this, and I think I'm starting to understand why it didn't.

Looking at different parsers, handling seems to be all over the place:

https://johnmacfarlane.net/babelmark2/?normalize=1&text=some+text+%60some+html+code+we+want+to+wrap+%3Cbody%3E%0A%3Cdiv%3Etext%3C%2Fdiv%3E%3C%2Fbody%3E%60%0A

In the end, there may not be a "right" way, just curious if this case had been considered.

@facelessuser
Copy link
Collaborator

I don't necessarily think the above question demonstrates an additional problem, just curious if this is the intention or not.

@waylan
Copy link
Member

waylan commented Nov 12, 2020

but it also added the new functionality where if a tag will be treated as HTML if it is at the start of the line with no empty line before it. The old processor didn't do this, and I think I'm starting to understand why it didn't.

Actually, that is unrelated to the issue. It would have actually been more work to replicate the old behavior. The fact that no blank line is required is a red herring. The examples provided here behave the same whether there is a blank line or not.

The issue is that the html parser doesn't care about {insignificant in HTML) whitespace. In Markdown, a raw HTML block can only start at the beginning of a line, whereas in HTML the block can begin anywhere. The trick is getting the parser to ignore everything that Markdown does not see as a valid opening block tag. That means we need to anticipate every scenario in which the html parser should ignore an opening block tag and force the parser to ignore it. Apparently this is an edge case we missed. Although I was sure I had a test for this specific case. I can't find it now though, so perhaps not.

I am slightly curious about your thoughts on this case:

some text `some html code we want to wrap <body>
<div>text</div></body>`

I would have expected the only reasonable rendering to be:

<p>some text <code>some html code we want to wrap &lt;body&gt;
&lt;div&gt;text&lt;/div&gt;&lt;/body&gt;</code>
</p>

However, I see that markdown.pl 1.0.2b8 outputs the same as we do:

<p>some text `some html code we want to wrap
    
    <body>
</p>
<div>text</div>
<p>
    </body>`</p>

So was this a situation where I intentionally changed the behavior or was this accidental? I don't recall. I also see that Commonmark almost matches this behavior (they leave off the wrapping <p> tag). While we aren't a Commonmark parser, I'm curious how this behavior is justified in their spec. Interestingly, the behavior changes significantly when we eliminate the multi-line code span.

In the end, I think that your example is a completely different issue that the original report.

@facelessuser
Copy link
Collaborator

facelessuser commented Nov 12, 2020

Yes, I agree it was only related by the fact that I starting thinking about this when I saw this code related issue. Mainly that HTML blocks are handled before inline code, but yes a separate issue (if even an issue at all).

I suspect that this current issue (unrelated to what I brought up) is a problem with differences in how preprocessors are handled vs block tags, but I haven't looked that deep into this, only confirmed that this appears to be a bug.

@waylan
Copy link
Member

waylan commented Nov 12, 2020

The problem here is that <?php is not a complete tag. There is no closing bracket and the parser is not smart enough to recognize that. Consider this basic test using Python's EventCollector which outputs a list of "tokens" identified by the parser:

from test.test_htmlparser import EventCollector

src="""
A text

Another text `<?php`.

<div>
hello
</div>
"""

ec = EventCollector()
ec.feed(src)
ec.close()
print(ec.get_events())

The output is:

[
    ('data', '\nA text\n\nAnother text `'), 
    ('pi', 'php`.\n\n<div'), 
    ('data', '\nhello\n'), 
    ('endtag', 'div'), 
    ('data', '\n')
]

Notice the second token, which is a processing instruction (pi). The data for that token should be php, but it is php`.\n\n<div. It consumes everything until the first > is found. Of course, for block level HTML, that is what we want. However we are in a code span. There is no reason to expect valid HTML.

We have a similar issue with this input:

text `<div`.

<div>
hello
</div>

We get:

[
    ('data', '\ntext `'), 
    ('starttag', 'div`.', [('<div', None)]), 
    ('data', '\nhello\n'), 
    ('endtag', 'div'), 
    ('data', '\n')
]

The starttag is all of <div`.\n\n<div>. Of course, the tag name is div`. (including the backtick and period) and everything after the whitespace (two line breaks in this case) is parsed as attributes. Naturally, <div doesn't parse as any attributes, so they are None. However, in the process, \n\n<div is completely lost by the parser.

This is completely an issue with how the parser works and not the fault of our work. Unless we can find a way to change the way the underlying parser works, we would need to completely abandon using the HTML parser. If we had caught this before releasing 3.3, this would have been a blocker to the entire thing.

@waylan
Copy link
Member

waylan commented Nov 12, 2020

\n\n<div is completely lost by the parser.

And this is why line_offest fails. Two newlines are removed from the document, causing the line offset to be wrong. Presumably there should always be at least as many newlines in a document as the value of self.lineno. With that presumption, we can always assume that a match will occur. But when some newlines get lost, that is no longer the case. However, if we address the root cause, then this would never occur and would be a non-issue.

@facelessuser
Copy link
Collaborator

So they're is no way to abandon an incomplete tag? Are you implying this isn't solvable?

@waylan
Copy link
Member

waylan commented Nov 12, 2020

So they're is no way to abandon an incomplete tag? Are you implying this isn't solvable?

This would not be solvable with the public API of HtmlParser. It would require overriding the methods in the parent class, which I have been trying to avoid. I haven't actually dug into the code to see how extensive of a change it would be. A few similar issues were resolved by monkey-patching a regex. But this is dependent upon whether we are in a raw block or not, which means it depends on the status of inraw, which is unique to our implementation. We are now mixing the two layers in ways we have thus far avoided.

@waylan
Copy link
Member

waylan commented Nov 12, 2020

Is there any reason to support/allow a backtick to be in a tag at all (tag name or attributes)? Perhaps if we simply disallowed backticks within tags (between < and >), then we could abandon a tag when it contains a backtick. That might be doable by overriding a few regex. I haven't looked at the code yet, Just a thought.

@facelessuser
Copy link
Collaborator

You can't really be sure what will be in an attribute.

@facelessuser
Copy link
Collaborator

Additionally, a preprocessing statement could have anything in it.

@waylan
Copy link
Member

waylan commented Nov 13, 2020

Right. Both good points. Scratch that idea.

@iamvdo
Copy link
Author

iamvdo commented Nov 13, 2020

I'm adding a second use case, where parser fails, but silently (no error).

If code inside a div with markdown_extra contains a HTML block element, it fails (p, div, ul, etc.). It seems that block element is generated.

Fails

<div markdown="1">
A `<p>`
</div>

With an HTML inline element, no problem. Also fails with <script> (at least)

Fails in 3.3.3 and works in 3.2.1

Is it related, or should I open a new issue?

@facelessuser
Copy link
Collaborator

<div markdown="1">
A `<p>`
</div>

I would expect these cases to be treated just like any span element. It looks like it is still trying to treat these as block elements, but if they don't have a newline, I think it shouldn't get block logic.

While not entirely the same as the opening post, I do feel there is a bit of a relation. I don't think preprocessing statements should be treated as such if they are not at the start of the line either and should be evaluated similarly to spans.

My concern is that it is easy to parse the HTML tags when they are on their own line, but when they are in the middle of a paragraph, we have no context to know whether they are a tag or raw content until we are in the inline parser. This is due to the way Python Markdown is designed. Other parsers are designed a bit differently and may have an easier time with this scenario than we do.

Can we treat block elements differently if they are not at the start of a line?

@waylan
Copy link
Member

waylan commented Nov 18, 2020

The second case reported by @iamvdo is a separate issue which was reported in #1068. Please, let's not mix issues here. To be clear, this issue specifically deals with code spans in which a tag does not contain a closing bracket, for example <p rather than <p>. Issues parsing code spans which include the closing bracket are the subject of #1068 and will need a very different fix than this one.

waylan added a commit to waylan/markdown that referenced this issue Nov 18, 2020
waylan added a commit that referenced this issue Nov 18, 2020
This reverts part of 2766698 and re-implements handling 
of tails in the same manner as the core.

Also, ensure line_offset doesn't raise an error on bad input
(see #1066) and properly handle script tags in code
spans (same as in the core).

Fixes #1068.
@waylan
Copy link
Member

waylan commented Nov 18, 2020

Note that the fix to #1068 (in #1069) addresses the issue with line_offset raising an error. Therefore, we no longer get an error here, which is good. It also changed the output of the examples here, although they are still wrong.

text `<div`.

<div>
hello
</div>

now outputs:

<p>text <code>&lt;div</code>.</p>
<p><div>
hello
</div></p>

Note that the actual div is wrapped in a <p>. However, the incomplete tag in the code span appears to be parsed correctly. The same thing happens with the php example in the original report. We get this output:

<p>A text</p>
<p>Another text <code>&lt;?php</code>.</p>
<p><div>
hello
</div></p>

While that is clearly wrong, it is a lot less alarming that the previous behavior. It is pretty clear to me that the reason is as outlined in this comment. That said, the issue in #1068 was certainly adding additional complexity to the problem. So now that that is out of the way, we can focus on a fix for this issue.

To break down the issue, let's use the div example. The starttag consumes all of <div`.\n\n<div>. However, after the fix to #1068, the code span is correctly recognized as such and the text (obtained by calling get_starttag_text and avoids the data loss described above) is treated as data rather than an opening tag to a block. Of course, the problem is that an actual opening tag to a block is included in the end of that but it never gets parsed as such. Therefore, the state of the parser is not in a block when it encounters the closing </div> of that block. For that reason, the closing tag is also treated as data/inline HTML. which results in the preprocessor returning the same document that we started with. The tags then get handled as inline HTML by the inline processor.

As an aside, the behavior is exactly the same with or without the md_in_html extension as that extension now matches the core (after 1068 was fixed).

@waylan
Copy link
Member

waylan commented Nov 19, 2020

In addition to `<div`, `<?`, we also need to account for `<!`, which is either a declaration or 'bogus comment' (a 'bogus comment' is a comment which does not include 2 hyphens with the opening and closing brackets). This issue does not seem to exist with valid comments (which use the 2 hyphens ).

In my initial attempt to address the issue, I found a solution for the <? case:

+    def parse_pi(self, i):
+        if self.at_line_start() or self.intail:
+            return super().parse_pi(i)
+        # This is not the beginning of a raw block so treat as plain data
+        # and avoid consuming any tags which may follow (see #1066).
+        self.handle_data('<?')
+        return i + 2

And it works well in our simple test cases. However, this means that we can never have a processing instruction which doesn't start a line, even within a raw block. For example, consider this:

<div markdown="1"><?php print("foo"); ></div>

Yes, that works with the default behavior. handle_pi just calls handle_data anyway in that case. However, with md_in_html, handle_pi is building an etree object and handle_pi needs to create an etree processing instruction. This fix makes that impossible unless we also override the method in the extension and add a test for self.mdstack.

The same concerns exist for any similar fix for the other cases (<div and <!).

As an aside, I discovered another bug, which is detailed in #1070. Until that is resolved, the concerns raised in this comment are moot for PIs, although they are still valid for the other cases.

It occurs to me that it would be surprising to me for someone to expect to have valid PHP embedded within the HTML generated by Markdown. However, I have seen people try crazier things. And there should be no reason why we would intentionally prevent it from working. Using PIs was simply the easiest way to demonstrate the problem with the proposed fix.

In the end I suspect a better solution would be to only run the HTML parser on raw HTML blocks from within the block parser. I tried that in the early commits in #803. However, I reverted back to a preprocessor in 7a8a6b5 as I was encountering too many complications. I wonder if that would be less of an issue now that the parser covers more edge cases than it did at the time.

@waylan waylan changed the title htmlparser.py sometimes fail unclosed tag in code span Nov 19, 2020
@waylan
Copy link
Member

waylan commented Nov 19, 2020

Working on #1070 I was reminded that we have modified processing instructions to require them to end with ?>. A tag which starts with <? and ends only with > is not a tag at all. Instead, the parser falls back to passing it as text data to handle_data. Of course, the same issue exists in that <?php`.\n\n<div> is passes as data when we should only be passing <div`.\n\n as data. As that fallback only occurs after failing to find a closing ?>, we could override the fallback to end immediately before < or even at a backtick.

However, that doesn't address the other cases (<div and <!). Ideally a more general solution could address all three. Besides that fallback is buried right in the middle of the internal method goahead, which I don't really want to override.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants