Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887

Merged
merged 8 commits into from
Nov 2, 2021

Conversation

lukasbestle
Copy link
Member

@lukasbestle lukasbestle commented Oct 31, 2021

Describe the PR

In the end it was a combination of @nilshoerrmann's suggestion and what I had in mind. Thank you Nils for the inspiration!

Modifications were needed in two different places because:

  1. libxml needs to be told to parse the input as UTF-8. A <meta> tag would create an empty <head> and converting everything to entities is a hack. The XML declaration works great and is the easiest to remove right afterwards.
  2. If the <meta> tag is used and kept in the document, additional processing (e.g. in $dom->sanitize()) will think that it's part of the input. So everything we add needs to be removed immediately after parsing.
  3. On output we need the <meta> tag again as it's the only way to avoid an export as ISO-8859-1 with entities.

I also used the opportunity to integrate the output consistency improvements from the Sane classes into Dom to make sure that the Dom::toString() output matches the parsed input as closely as possible.

Release notes (only for RC, not for stable release)

Fixed regressions

  • Unicode characters in the writer field are no longer converted to HTML entities.

Enhancements

  • The Toolkit\Dom::toString method now exports the document with the same structure as the input.

Release notes (stable release)

Fixes

  • The image/svg MIME type is now recognized by the Sane classes

Breaking changes

None

Related issues/ideas

Ready?

  • Unit tests for fixed bug/feature
  • In-code documentation (wherever needed)
  • CI checks pass

When merging

@lukasbestle lukasbestle added this to the 3.6.0-rc.3 milestone Oct 31, 2021
@lukasbestle lukasbestle requested a review from a team October 31, 2021 21:34
@lukasbestle lukasbestle self-assigned this Oct 31, 2021
@lukasbestle lukasbestle linked an issue Oct 31, 2021 that may be closed by this pull request
@lukasbestle
Copy link
Member Author

@afbora Could you please also validate that the unsaved changes loop is fixed by this? I wasn't able to reproduce it.

@afbora
Copy link
Member

afbora commented Oct 31, 2021

@lukasbestle Unfortunately 🙁 I think this issue is also related to the writer field.

stuck.mp4

@afbora
Copy link
Member

afbora commented Oct 31, 2021

It actually saves to content correctly. The following may give you an idea, when you refresh the page:

image (33)

@nilshoerrmann
Copy link
Contributor

Have you thought about putting the api response into the DOM before comparing the values?

@afbora
Copy link
Member

afbora commented Nov 1, 2021

Replacing special characters in the writer field seems to fix it.

Modifications were needed in two different places because:
1. `libxml` needs to be told to parse the input as UTF-8. A `<meta>` tag would create an empty `<head>` and converting everything to entities is a hack. The XML declaration works great and is the easiest to remove right afterwards.
2. If the `<meta>` tag is used and kept in the document, additional processing (e.g. in `$dom->sanitize()`) will think that it's part of the input. So everything we add needs to be removed immediately after parsing.
3. On output we need the `<meta>` tag again as it's the only way to avoid an export as `ISO-8859-1` with entities.

Fixes #3798.
@lukasbestle lukasbestle force-pushed the fix/3798-dom-entities branch from af1ee1f to c97cb3f Compare November 1, 2021 11:26
@lukasbestle lukasbestle changed the title Dom: Proper handling for Unicode in HTML documents Dom: Proper handling for Unicode in HTML and XML documents; more consistent export Nov 1, 2021
- HTML snippets are exported with the same structure as the input.
- XML files that were imported without XML declaration now get exported like this as well.
- Both behaviors can be overridden with a new `$normalize` argument for the `toString()` method.
@lukasbestle lukasbestle force-pushed the fix/3798-dom-entities branch from c97cb3f to d1e112b Compare November 1, 2021 15:21
@lukasbestle
Copy link
Member Author

@afbora I'm done refactoring the Dom class for output consistency. Now it shouldn't matter what you throw at it, you will always get an output with the same structure. I can now also reproduce the writer field issue and it's still there. So I think it's a separate issue not directly related to Sane.

@lukasbestle lukasbestle force-pushed the fix/3798-dom-entities branch from d1e112b to 54c0d56 Compare November 1, 2021 15:28
@afbora
Copy link
Member

afbora commented Nov 1, 2021

Thank you. While @bastianallgeier is reviewing this PR, I will then create a PR for the solution I have in mind regarding the writer field issue.

@bastianallgeier bastianallgeier merged commit 88cbd9b into develop Nov 2, 2021
@bastianallgeier bastianallgeier deleted the fix/3798-dom-entities branch November 2, 2021 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Stuck in an unsaved changes loop
4 participants