Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887

lukasbestle · 2021-10-31T21:34:52Z

Describe the PR

In the end it was a combination of @nilshoerrmann's suggestion and what I had in mind. Thank you Nils for the inspiration!

Modifications were needed in two different places because:

libxml needs to be told to parse the input as UTF-8. A <meta> tag would create an empty <head> and converting everything to entities is a hack. The XML declaration works great and is the easiest to remove right afterwards.
If the <meta> tag is used and kept in the document, additional processing (e.g. in $dom->sanitize()) will think that it's part of the input. So everything we add needs to be removed immediately after parsing.
On output we need the <meta> tag again as it's the only way to avoid an export as ISO-8859-1 with entities.

I also used the opportunity to integrate the output consistency improvements from the Sane classes into Dom to make sure that the Dom::toString() output matches the parsed input as closely as possible.

Release notes (only for RC, not for stable release)

Fixed regressions

Unicode characters in the writer field are no longer converted to HTML entities.

Enhancements

The Toolkit\Dom::toString method now exports the document with the same structure as the input.

Release notes (stable release)

Fixes

The image/svg MIME type is now recognized by the Sane classes

Breaking changes

None

Related issues/ideas

Related to, but does not fix Stuck in an unsaved changes loop #3798.

Ready?

Unit tests for fixed bug/feature
In-code documentation (wherever needed)
CI checks pass

When merging

~~Add to website docs release checklist (if needed)~~
Add changes to release notes draft in Notion

lukasbestle · 2021-10-31T21:35:43Z

@afbora Could you please also validate that the unsaved changes loop is fixed by this? I wasn't able to reproduce it.

afbora · 2021-10-31T21:55:42Z

@lukasbestle Unfortunately 🙁 I think this issue is also related to the writer field.

stuck.mp4

afbora · 2021-10-31T22:00:50Z

It actually saves to content correctly. The following may give you an idea, when you refresh the page:

nilshoerrmann · 2021-11-01T07:17:39Z

Have you thought about putting the api response into the DOM before comparing the values?

afbora · 2021-11-01T07:51:28Z

Replacing special characters in the writer field seems to fix it.

Modifications were needed in two different places because: 1. `libxml` needs to be told to parse the input as UTF-8. A `<meta>` tag would create an empty `<head>` and converting everything to entities is a hack. The XML declaration works great and is the easiest to remove right afterwards. 2. If the `<meta>` tag is used and kept in the document, additional processing (e.g. in `$dom->sanitize()`) will think that it's part of the input. So everything we add needs to be removed immediately after parsing. 3. On output we need the `<meta>` tag again as it's the only way to avoid an export as `ISO-8859-1` with entities. Fixes #3798.

- HTML snippets are exported with the same structure as the input. - XML files that were imported without XML declaration now get exported like this as well. - Both behaviors can be overridden with a new `$normalize` argument for the `toString()` method.

lukasbestle · 2021-11-01T15:26:23Z

@afbora I'm done refactoring the Dom class for output consistency. Now it shouldn't matter what you throw at it, you will always get an output with the same structure. I can now also reproduce the writer field issue and it's still there. So I think it's a separate issue not directly related to Sane.

A doctype can only be used in a full document, so the export of the full document needs to be enforced.

afbora · 2021-11-01T16:11:48Z

Thank you. While @bastianallgeier is reviewing this PR, I will then create a PR for the solution I have in mind regarding the writer field issue.

lukasbestle added the type: bug 🐛 label Oct 31, 2021

lukasbestle added this to the 3.6.0-rc.3 milestone Oct 31, 2021

lukasbestle requested a review from a team October 31, 2021 21:34

lukasbestle self-assigned this Oct 31, 2021

lukasbestle linked an issue Oct 31, 2021 that may be closed by this pull request

Stuck in an unsaved changes loop #3798

Closed

lukasbestle added 2 commits November 1, 2021 11:26

Dom: Support for lowercase doctypes

84e6491

lukasbestle force-pushed the fix/3798-dom-entities branch from af1ee1f to c97cb3f Compare November 1, 2021 11:26

lukasbestle changed the title ~~Dom: Proper handling for Unicode in HTML documents~~ Dom: Proper handling for Unicode in HTML and XML documents; more consistent export Nov 1, 2021

lukasbestle added 2 commits November 1, 2021 15:41

Sane: Add new alias for SVG MIME type

602a782

lukasbestle force-pushed the fix/3798-dom-entities branch from c97cb3f to d1e112b Compare November 1, 2021 15:21

lukasbestle added 4 commits November 1, 2021 16:28

Dom: Consistent export of trailing newlines

0351498

Dom: Fix export of HTML snippets with doctype

c8c8869

A doctype can only be used in a full document, so the export of the full document needs to be enforced.

Adapt Sane tests to new Dom behavior

50a5855

Adapt Parsley tests to new Dom behavior

54c0d56

lukasbestle force-pushed the fix/3798-dom-entities branch from d1e112b to 54c0d56 Compare November 1, 2021 15:28

bastianallgeier approved these changes Nov 2, 2021

View reviewed changes

bastianallgeier merged commit 88cbd9b into develop Nov 2, 2021

bastianallgeier deleted the fix/3798-dom-entities branch November 2, 2021 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887

Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887

lukasbestle commented Oct 31, 2021 •

edited

Loading

lukasbestle commented Oct 31, 2021

afbora commented Oct 31, 2021

afbora commented Oct 31, 2021

nilshoerrmann commented Nov 1, 2021

afbora commented Nov 1, 2021

lukasbestle commented Nov 1, 2021

afbora commented Nov 1, 2021

Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887

Dom: Proper handling for Unicode in HTML and XML documents; more consistent export #3887

Conversation

lukasbestle commented Oct 31, 2021 • edited Loading

Describe the PR

Release notes (only for RC, not for stable release)

Fixed regressions

Enhancements

Release notes (stable release)

Fixes

Breaking changes

Related issues/ideas

Ready?

When merging

lukasbestle commented Oct 31, 2021

afbora commented Oct 31, 2021

afbora commented Oct 31, 2021

nilshoerrmann commented Nov 1, 2021

afbora commented Nov 1, 2021

lukasbestle commented Nov 1, 2021

afbora commented Nov 1, 2021

lukasbestle commented Oct 31, 2021 •

edited

Loading