-
-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887
Conversation
@afbora Could you please also validate that the unsaved changes loop is fixed by this? I wasn't able to reproduce it. |
@lukasbestle Unfortunately 🙁 I think this issue is also related to the writer field. stuck.mp4 |
Have you thought about putting the api response into the DOM before comparing the values? |
Replacing special characters in the writer field seems to fix it. |
Modifications were needed in two different places because: 1. `libxml` needs to be told to parse the input as UTF-8. A `<meta>` tag would create an empty `<head>` and converting everything to entities is a hack. The XML declaration works great and is the easiest to remove right afterwards. 2. If the `<meta>` tag is used and kept in the document, additional processing (e.g. in `$dom->sanitize()`) will think that it's part of the input. So everything we add needs to be removed immediately after parsing. 3. On output we need the `<meta>` tag again as it's the only way to avoid an export as `ISO-8859-1` with entities. Fixes #3798.
af1ee1f
to
c97cb3f
Compare
- HTML snippets are exported with the same structure as the input. - XML files that were imported without XML declaration now get exported like this as well. - Both behaviors can be overridden with a new `$normalize` argument for the `toString()` method.
c97cb3f
to
d1e112b
Compare
@afbora I'm done refactoring the |
A doctype can only be used in a full document, so the export of the full document needs to be enforced.
d1e112b
to
54c0d56
Compare
Thank you. While @bastianallgeier is reviewing this PR, I will then create a PR for the solution I have in mind regarding the writer field issue. |
Describe the PR
In the end it was a combination of @nilshoerrmann's suggestion and what I had in mind. Thank you Nils for the inspiration!
Modifications were needed in two different places because:
libxml
needs to be told to parse the input as UTF-8. A<meta>
tag would create an empty<head>
and converting everything to entities is a hack. The XML declaration works great and is the easiest to remove right afterwards.<meta>
tag is used and kept in the document, additional processing (e.g. in$dom->sanitize()
) will think that it's part of the input. So everything we add needs to be removed immediately after parsing.<meta>
tag again as it's the only way to avoid an export asISO-8859-1
with entities.I also used the opportunity to integrate the output consistency improvements from the
Sane
classes intoDom
to make sure that theDom::toString()
output matches the parsed input as closely as possible.Release notes (only for RC, not for stable release)
Fixed regressions
Enhancements
Toolkit\Dom::toString
method now exports the document with the same structure as the input.Release notes (stable release)
Fixes
image/svg
MIME type is now recognized by theSane
classesBreaking changes
None
Related issues/ideas
Ready?
When merging
Add to website docs release checklist (if needed)