Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing tag is missing within <![CDATA #571

Closed
bytecode77 opened this issue Oct 11, 2024 · 17 comments
Closed

Closing tag is missing within <![CDATA #571

bytecode77 opened this issue Oct 11, 2024 · 17 comments
Assignees

Comments

@bytecode77
Copy link

1. Description

Closing tag is missing within <![CDATA object.

Background: This issue surfaced when parsing HTML <code> blocks. The HTML content is given by Confluence, where the HTML is parsed from.

3. Fiddle or Project

https://dotnetfiddle.net/XmcOB6

Expected Result

<![CDATA[<a href="foo">
bar
</a>]]>

Actual Result

<![CDATA[<a href="foo">
bar
]]>

4. Any further technical details

  • HAP version: 1.11.67
  • NET version 7.0
@JonathanMagnan JonathanMagnan self-assigned this Oct 11, 2024
@JonathanMagnan
Copy link
Member

JonathanMagnan commented Oct 11, 2024

Hello @bytecode77 ,

Looking at Firefox and Chrome, I believe the current actual result from HAP is the right one:

FireFox:

image

The "comment" section end at the first occurrence of the > found. This means the end tag </a> is considered part of the HTML, but an end tag alone cannot really exist, so it gets removed.

I'm not exactly sure of all the rules that is currently applied, but the current actual result is indeed what I'm expecting.

Let me know if that explains the reason correctly.

Best Regards,

Jon

@bytecode77
Copy link
Author

Thanks for your quick response Jonathan!

For a comment, I think the end tag is not useful to the DOM. However, the code does not represent a comment:

<code><![CDATA[<a href=\"foo\">\r\nbar\r\n</a>]]></code>

Specifically, this code was retrieved by the Confluence API when reading Confluence pages. I'm not sure what the purpose of <![CDATA[ here is, though.

@JonathanMagnan
Copy link
Member

However, the code does not represent a comment:

Indeed, you are right. I just assumed this, looking at the result, but that's not the case.

I never had to really use the <![CDATA[ as far as I remember, but looking at what I see on internet, it looks more related to be used within a script tag but not exclusively to this.

At this moment, it still looks like the current behavior looks more like the normal behavior unless I'm proving wrong. Again, I'm not familiar with this tag, so I could definitely be wrong.

Best Regards,

Jon

@bytecode77
Copy link
Author

This is the original HTML that is a Confluence page export that I'm parsing.

It does contain a <![CDATA[ within the text-body of a <ac:structured-macro ac:name="code" and it represents a Confluence code box.

Yes, it was used within <script> tags way before javascript was common to not offend non-supporting browsers. But it remains valid HTML that is, indeed, used:

<ac:structured-macro ac:name="code" ac:schema-version="1" ac:macro-id="70aacf91-111a-4b25-8c3c-543aa6fd0af9">
	<ac:plain-text-body>
		<![CDATA[<a href="foo" target="_blank">
  bar
</a>]]>
	</ac:plain-text-body>

@JonathanMagnan
Copy link
Member

Hello @bytecode77 ,

Thank you for the additional info. I have looked at the HAP code, and the <!CDATA[ tag is not supported. The current behavior is more a combination of "Comment" and "Text" nodes that show the same result as Firefox.

I would not like to change the current default behavior, but I'm open to looking more at it to support it the way you want through an option that you will need to enable.

I should be able to look more at it later this week

Best Regards,

Jon

@bytecode77
Copy link
Author

Thanks for looking into it, Jon!

I'll stay tuned for your updates :)

@JonathanMagnan
Copy link
Member

Hello @bytecode77 ,

A new version has been released today: https://github.com/zzzprojects/html-agility-pack/releases/tag/v1.11.68

Could you try the new option and let us know if everything is working as expected.

Best Regards,

Jon

@bytecode77
Copy link
Author

Hi Jonathan!

thank's for providing a new version! However, unfortunately the result is the same as before, see the dotnetfiddle that I posted initially.

To provide a more concrete example: When importing HTML from Confluence, this is what the HTML string looks like:

...
<ac:plain-text-body>
<![CDATA[<iframe src="xxxxxx" width="600" height="1000" frameborder="0">
  <a href="xxxxx" target="_blank">
    xxxxx
  </a>

</iframe>]]>
</ac:plain-text-body>
...

It's a code box showing some HTML.

However, the InnerHtml property is missing the closing tag:

<![CDATA[<iframe src="https://www.terminland.de/hno/?mode=frame" width="600" height="1000" frameborder="0">
  <a href="https://www.terminland.de/hno" target="_blank">
    Online-Terminbuchung
  </a>

]]>

@JonathanMagnan
Copy link
Member

Hello @bytecode77 ,

Fiddle currently still uses the previous version of HAP and not the latest one (don't ask me why!).

You need to turn on the following options to make it works:

HtmlDocument document = new HtmlDocument();
document.OptionThreatCDataBlockAsComment = true;

I tested both your examples (the first one and the one you just posted), and both seem to work very fine with this option.

Best Regards,

Jon

@bytecode77
Copy link
Author

Ah, I see. However, unfortunately the option OptionThreatCDataBlockAsComment is not included in your latest build. I confirmed this by installing 1.11.68 in a completely blank project on another computer to make sure that this isn't a nuget cache issue of some sort.

@JonathanMagnan
Copy link
Member

Oh god my bad, It looks like I did a bad release!

The v1.11.69 is now available. I double-checked to ensure I had not made the same mistake twice 😆

Best Regards,

Jon

@JonathanMagnan
Copy link
Member

Let me know if this version was working

@bytecode77
Copy link
Author

Works like a charm! Thanks for your support, Jonathan, and keep up the nice work. I really love HtmlAgilityPack and I've been using it for many years.

Bye!

@wo80
Copy link

wo80 commented Oct 25, 2024

@JonathanMagnan Regarding the spelling of OptionThreatCDataBlockAsComment: you probably meant treat and not threat ;-)

@bytecode77
Copy link
Author

You're right, wo80, didn't notice :D

@JonathanMagnan I think if you change the typo and break upward compatibility, there is only one person (me) who would be affected, but that's no problem. When I update some time in the future I will just change the wording on the calling site, too.

@JonathanMagnan
Copy link
Member

JonathanMagnan commented Oct 25, 2024

Do'h! You are 100% right @wo80

It will be changed the next time we release HAP, which will probably happen next month.

@JonathanMagnan
Copy link
Member

Hello,

The v1.11.70 is now available with the option renamed as proposed by @wo80 ;)

Best Regards,

Jon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants