-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Since 1.17.1 CDATA being added around other CDATA in XHTML documents #2078
Comments
I feel this is the crux of the issue. Logically there are 3 seperate node types in play - a TextNode, a DataNode, and a CDataNode. Each are leaf nodes -- logically a TextNode (or DataNode) could not contain a CDataNode. Either the parser parses each individually (so perhaps if this were in a In a HTML parse, the contents of a And if I run your original HTML in Chrome, I actually get a Javascript syntax error:
As it's also not tokenising the Back in the olden days I recall there was special handling in the script flow that would parse the CData section specifically (as there is for other markup types - CData section). Have you got an example URL with script data like this (with Cdata) that a browser executes? I would like to inspect to see if something else is supposed to kick it into XML parse mode. Otherwise, I'm not sure how to appropriately handle the situation. |
Hi @jhy, Here is an example of a full XHTML using CDATA properly working on browsers:
The rationale about this style of CDATA usage is that when using XHTML, for the document to be both compatible with HTML and XML parsers it is good practice to use a commented CDATA on the <script>, as it is common for scripts to have special characters that would require escaping in XML but would not be escaped in HTML. This way those special characters does not need to be escaped and become compatible in both XML (that will take CDATA into account) and HTML (that does not take CDATA into account but will just ignore it as it is commented). As a reference of this I present you the Appendix A of XHTML media type document that describes compatibility guidelines ("This appendix summarizes design guidelines for authors who wish their XHTML documents to render on both XHTML-aware and modern HTML user agents"). The item A.4 of this appendix describes this style of CDATA usage to obtain such compatibility with embedded script and style tags. Here is a link for this item: https://www.w3.org/TR/2009/NOTE-xhtml-media-types-20090116/#C_4 With the changes made in 1.17.1, the use of jsoup to process XHTML (using jsoup to make some modifications on the XHTML for example) and then using the resulting source-code read from jsoup now generates a XHTML that fails as both XML and HTML due to that added CDATA in the script element. Until jsoup 1.16.2 I could do this without any problem. |
Thanks, got it. So in context the browser is not parsing the CData as anything special and just a data literal, but the Javascript parser drops it as the comment. Not sure the best approach here. How are you presenting the content after your trip through jsoup? As HTML or XML? I'm thinking that if you are presenting it as HTML you are not going to want a CData directly after the Maybe the concept of #1720 needs to change to do a deeper introspection and add / preserve a comment node. But then it's moving away from the DataNode just changing its escape manner. And it won't work directly for e.g. Maybe there's another output mode required which is XHTML. But I don't really think that's a good path and think in most cases folks should move away from that. But perhaps that's the implicit goal here. Maybe to help me understand better and weave this path - why are you setting the output syntax to XML? It's not about parsing with an XML parser, right? |
In the end the content is presented on the browser as HTML, but as a XHTML due to the doctype. As its doctype specifies it as XHTML, it is required to follow XHTML specifications so it also passes some code validations, just being able to be rendered by modern browsers while not following the specifications would not be enough for some clients that are stricter in their validations.
I set the output syntax to XML so the generated code is following the XHTML specification. XHTML requires the code to be XML compatible. Although I am not parsing it with a XML parser myself, the specification requires it to be compatible with it. I also have scenario of both type of inputs, XHTML and HTML5. So I basically use jsoup to parse the input no matter which of both types it is, and then after processing, to generate the output, if the input was XHTML I use Syntax.xml, and if the input was HTML5 I use Syntax.html. |
OK so that kind of connects with my thought that a OutputSyntax type of XHTML would make sense in this context. It's neither HTML nor XML. Perhaps a different question would be why have a goal of producing XHTML? As it's been dead since 2005 ~ 2007 at least. (Per your edit, makes sense; not producing new XHTML but trying to be light touch on existing older styles, I assume.) Without introducing a new mode the best I can think of is to have the DataNode see if it's in a @maxfortun could you take a look at this thread? Wondering what context you're using the output and why it doesn't need the |
Yes, new code now uses HTML5 not XHTML. But to keep compatibility with some older code base that uses XHTML we still support it.
Would it be possible for a DataNode that was not modified to just keep its text as is? Like if the decision to introduce a CDATA in it would only be needed if it is modified. |
CDATA in CDATA... What an interesting oversight on my part. If I recall correctly at the time I used this library to convert Google web stories in AMP format, which is xhtml into an xml document for further XSL transformations, and html entities were not being converted properly into their xml equivalents. Instead of escaping the individual characters I just wrapped the whole section in CDATA, and now I see that character conversion is probably a more appropriate way to go... |
If this is any helpful, this is my use-case:
|
I suspect that @maxfortun 's use case, as it uses the output as XML, it would work if the input does not already contain CDATA. But if it did, it would break as it would add another CDATA around the existent one. |
Thanks @rgevaerd and @maxfortun for the discussion - I've fixed this now. It would be great if you could review and test (with a snapshot version) and let me know if any issues. |
@jhy reviewing the code seems good to me. I could not test it, as I tried to get the snapshot from https://jsoup.org/packages/jsoup-1.17.2-SNAPSHOT.jar but it seems to be an old snapshot. Don't know where to get it updated. |
Thanks -- there is no currently published snapshot but you can build one manually quite simply. See "Building from Source" on the download page - the |
@jhy I tested from master and it worked fine. 👍 |
This MR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [flow-bin](https://github.com/flowtype/flow-bin) ([changelog](https://github.com/facebook/flow/blob/master/Changelog.md)) | devDependencies | minor | [`^0.224.0` -> `^0.225.0`](https://renovatebot.com/diffs/npm/flow-bin/0.224.0/0.225.1) | | [org.liquibase.ext:liquibase-hibernate5](https://github.com/liquibase/liquibase-hibernate/wiki) ([source](https://github.com/liquibase/liquibase-hibernate)) | build | patch | `4.25.0` -> `4.25.1` | | [org.liquibase:liquibase-maven-plugin](http://www.liquibase.org/liquibase-maven-plugin) ([source](https://github.com/liquibase/liquibase)) | build | patch | `4.25.0` -> `4.25.1` | | [org.jsoup:jsoup](https://jsoup.org/) ([source](https://github.com/jhy/jsoup)) | compile | patch | `1.17.1` -> `1.17.2` | | [io.hypersistence:hypersistence-utils-hibernate-62](https://github.com/vladmihalcea/hypersistence-utils) | compile | minor | `3.6.1` -> `3.7.0` | | [io.quarkus:quarkus-maven-plugin](https://github.com/quarkusio/quarkus) | build | patch | `3.6.3` -> `3.6.4` | | [io.quarkus:quarkus-universe-bom](https://github.com/quarkusio/quarkus-platform) | import | patch | `3.6.3` -> `3.6.4` | | [org.apache.maven.plugins:maven-compiler-plugin](https://maven.apache.org/plugins/) | build | minor | `3.11.0` -> `3.12.1` | --- ### Release Notes <details> <summary>flowtype/flow-bin</summary> ### [`v0.225.1`](flow/flow-bin@62a43fb...23ec616) [Compare Source](flow/flow-bin@62a43fb...23ec616) ### [`v0.225.0`](flow/flow-bin@e6104a1...62a43fb) [Compare Source](flow/flow-bin@e6104a1...62a43fb) </details> <details> <summary>liquibase/liquibase</summary> ### [`v4.25.1`](https://github.com/liquibase/liquibase/blob/HEAD/changelog.txt#Liquibase-4251-is-a-patch-release) [Compare Source](liquibase/liquibase@v4.25.0...v4.25.1) </details> <details> <summary>jhy/jsoup</summary> ### [`v1.17.2`](https://github.com/jhy/jsoup/blob/HEAD/CHANGES.md#​1172-2023-Dec-29) ##### Improvements - **Attribute object accessors**: Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply obtain an `Attribute` object. [2069](jhy/jsoup#2069) - **Attribute source tracking**: If source tracking is on, and an Attribute's key is changed ( via `Attribute.setKey(String)`), the source range is now still tracked in `Attribute.sourceRange()`. [2070](jhy/jsoup#2070) - **Wildcard attribute selector**: Added support for the `[*]` element with any attribute selector. And also restored support for selecting by an empty attribute name prefix (`[^]`). [2079](jhy/jsoup#2079) ##### Bug Fixes - **Mixed-cased source position**: When tracking the source position of attributes, if the source attribute name was mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not tracked correctly. [2067](jhy/jsoup#2067) - **Source position NPE**: When tracking the source position of a body fragment parse, a null pointer exception was thrown. [2068](jhy/jsoup#2068) - **Multi-point emoji entity**: A multi-point encoded emoji entity may be incorrectly decoded to the replacement character. [2074](jhy/jsoup#2074) - **Selector sub-expressions**: (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding to `[attr=va]` instead of `parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr to represent the query, allowing simpler and more thorough query parse tests. [2073](jhy/jsoup#2073) - **XML CData output**: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData section. [2078](jhy/jsoup#2078) - **Thread safety**: The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the iterator object is a thread-local. [2088](jhy/jsoup#2088) *** Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in [change-archive.txt](./change-archive.txt). </details> <details> <summary>vladmihalcea/hypersistence-utils</summary> ### [`v3.7.0`](https://github.com/vladmihalcea/hypersistence-utils/blob/HEAD/changelog.txt#Version-370---December-18-2023) \================================================================================ Oracle Interval Type does not support negative intervals [#​682](vladmihalcea/hypersistence-utils#682) Return original object if target and original are the same when merging [#​677](vladmihalcea/hypersistence-utils#677) Add a hypersistence-utils-hibernate-63 module for Hibernate 6.3 [#​657](vladmihalcea/hypersistence-utils#657) </details> <details> <summary>quarkusio/quarkus</summary> ### [`v3.6.4`](https://github.com/quarkusio/quarkus/releases/tag/3.6.4) [Compare Source](quarkusio/quarkus@3.6.3...3.6.4) ##### Complete changelog - [#​37808](quarkusio/quarkus#37808) - CLI - Rework how missing commands are detected - [#​37803](quarkusio/quarkus#37803) - Dev mode: add null checks to TimestampSet.isRestartNeeded() - [#​37798](quarkusio/quarkus#37798) - Only update ~/.docker/config.json if it exists - [#​37787](quarkusio/quarkus#37787) - Take priority into account in ConfigurationImpl - [#​37775](quarkusio/quarkus#37775) - Docs: fix typo in rabbitmq reference documentation - [#​37770](quarkusio/quarkus#37770) - Add SequencedCollection to BANNED_INTERFACE_TYPES - [#​37768](quarkusio/quarkus#37768) - Running application build with JDK21 and target Java 17 crash with NoClassDefFoundError: java/util/SequencedCollection - [#​37731](quarkusio/quarkus#37731) - Query logging is being done in io.quarkus.mongodb.panache.common.runtime.MongoOperations - [#​37723](quarkusio/quarkus#37723) - Do not use CSRF cookie as the next token value - [#​37717](quarkusio/quarkus#37717) - Docs: Fix incorrect link reference in Cross-Site Request Forgery Prevention guide - [#​37714](quarkusio/quarkus#37714) - Remove the driver property in the documentation for Cloud SQL - [#​37710](quarkusio/quarkus#37710) - Use NoStackTraceException in metrics - [#​37677](quarkusio/quarkus#37677) - Bump io.quarkus:quarkus-platform-bom-maven-plugin from 0.0.100 to 0.0.101 - [#​37654](quarkusio/quarkus#37654) - Make sure dev mode is properly written in doc - [#​36848](quarkusio/quarkus#36848) - CSRF Token with HMAC signature gets double signed </details> <details> <summary>quarkusio/quarkus-platform</summary> ### [`v3.6.4`](quarkusio/quarkus-platform@3.6.3...3.6.4) [Compare Source](quarkusio/quarkus-platform@3.6.3...3.6.4) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox. 👻 **Immortal**: This MR will be recreated if closed unmerged. Get [config help](https://github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC4yNC4wIiwidXBkYXRlZEluVmVyIjoiMzQuMjQuMCJ9-->
This test below using XHTML passes on 1.16.2 and fails on 1.17.1 because it adds another CDATA, generating something like:
...<script><![CDATA[<![CDATA[...]]>]]></script>
and such code causes a javascript error on a browser due to the duplicated CDATA.I suspect this problem was introduced by #1720
Please note that adding CDATA only if the text node doesn't start with
<![CDATA[
will not be enough to fix this, because the<![CDATA[
expression can be in the middle of the text node. A text node can even have multiple CDATA nodes. But a CDATA node cannot have another CDATA node in it.The text was updated successfully, but these errors were encountered: