Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some maintenance work #77

Merged
merged 7 commits into from
Nov 20, 2022
Merged

Some maintenance work #77

merged 7 commits into from
Nov 20, 2022

Conversation

lyrixx
Copy link
Member

@lyrixx lyrixx commented May 30, 2022

  • Allow org_heigl/hyphenator in ^3.0
  • Simplify composer.json 'scripts' section
  • Drop support for Symfony < 4.4 + Add support for ^6.0
  • Run tests on PHP 8.1

@lyrixx
Copy link
Member Author

lyrixx commented May 30, 2022

there is a failure on PHP 8.1, but I don't know how to fix it:

1) JoliTypo\Tests\EnglishTest::testFixFullText
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
 <h3>Pronun&shy;ci&shy;ation</h3>\n
 \n
 <p>A humor&shy;ous image announ&shy;cing the launch of a White House Tumblr suggests pronoun&shy;cing GIF with a hard &ldquo;G&rdquo;.</p>\n
-<p>The creat&shy;ors of the format pronounced GIF as &ldquo;Jif&rdquo; with a soft &ldquo;G&rdquo; /&#712;d&#658;&#618;f/ as in &ldquo;gin&rdquo;.</p>\n
-<p>An altern&shy;at&shy;ive pronun&shy;ci&shy;ation with a hard &ldquo;G&rdquo; /&#712;&#609;&#618;f/ as in &ldquo;graph&shy;ics&rdquo;, reflect&shy;ing the expan&shy;ded acronym, is in wide&shy;spread usage.</p>\n
+<p>The creat&shy;ors of the format pronounced GIF as &ldquo;Jif&rdquo; with a soft &ldquo;G&rdquo; /&Euml;&circ;d&Ecirc;&rsquo;&Eacute;&ordf;f/ as in &ldquo;gin&rdquo;.</p>\n
+<p>An altern&shy;at&shy;ive pronun&shy;ci&shy;ation with a hard &ldquo;G&rdquo; /&Euml;&circ;&Eacute;&iexcl;&Eacute;&ordf;f/ as in &ldquo;graph&shy;ics&rdquo;, reflect&shy;ing the expan&shy;ded acronym, is in wide&shy;spread usage.</p>\n
 <p>Both pronun&shy;ci&shy;ations are acknow&shy;ledged by the [&hellip;] Merriam-Webster&rsquo;s Collegi&shy;ate Diction&shy;ary.</p>\n
 \n
 <p>We also have &ldquo;<span>HTML in quote</span>&rdquo; to fix&hellip;</p>'

@HedicGuibert
Copy link
Member

The line causing the bug is this one :
https://github.com/jolicode/JoliTypo/blob/master/src/JoliTypo/Fixer.php#L311

For an uknown reason (from me at least), the behaviour of this function changed in PHP 8.1.
I tried running this code in both PHP 8.0 and PHP 8.1

$tofix = "/ˈdʒɪf/";

echo(mb_detect_encoding($tofix) . PHP_EOL);

mb_detect_order('ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15');
echo(mb_detect_encoding($tofix));

In PHP 8.0 I get

UFT-8
UTF-8

In PHP 8.1 I get

UTF-8
Windows-1252

Sandbox showing the issue : https://onlinephp.io/c/5f836

@HedicGuibert
Copy link
Member

So after doing some more research I found out that the function that has changed is not mb_encoding_order but mb_detect_encoding. In this bug report we can see the following answer from Alex Dowad :

Question :

I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?

Answer :

Yep. The earlier versions of mb_detect_encoding only checked which encodings
the input string was valid in, and picked the first one on the list. If the string was valid in more than one encoding, it did not do anything at all to try to figure out which one was most likely.

So now mb_detect_encoding just returns the most likely encoding and doesn't care about the order they were given. This is confirmed by this comment : php/php-src#8279 (comment)

Now it also applies heuristics to detect which of the valid text encodings in the specified list (if there are more than one) is most likely to be correct.

This is not what we want to do in this library. We do not want to get the most likely encoding, we want to validate that the text we are given is valid in some encodings, with some preferences.
Therefore I suggest we test the encodings one by one, probably in a loop like suggested in the comment above. This would restore our previous behaviour.

@HedicGuibert
Copy link
Member

I tried fixing it :

  • fix a test that was not working anymore (now it matches the other similar tests).
  • remove all the now useless mb_detect_order.
  • found a workaround for mb_detect_encoding to detect UTF-8 in priority but it is now returning UTF-8 all the time. This is what we had before and it seems to be ok.
  • removed encoding to HTML entites. It is deprecated to do it with mb_convert_encoding in PHP 8.2 (php/php-src@9308974). We actually don't want to decode thinks like &lt;3 to <3 because then libxml sees an open HTML tag. It was working before because we took advantage of the weird behaviour of mb_detect_encoding with some HTML entites (more info on this in the commit message above).

@lyrixx
Copy link
Member Author

lyrixx commented Jun 4, 2022

Awesome work!

I'll take care of the deprecation if you want.
Or if you want, you must add: https://github.com/symfony/recipes/blob/main/symfony/framework-bundle/5.3/config/packages/framework.yaml#L5

@@ -46,7 +46,7 @@ public function testFullPageMarkup()
HTML;

$fixed = <<<'STRING'
&#8220;Who Let the Dogs Out?&#8221; is a song written and originally recorded by Anslem Douglas (titled &#8220;Doggie&#8221;).
Who Let the Dogs Out? is a song written and originally recorded by Anslem Douglas (titled Doggie).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the behaviour changed here?! I don't get it :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version look good to me 👍🏼

@@ -356,7 +361,9 @@ private function fixContentEncoding($content)
mb_substr($content, $headPos);
}

$content = mb_convert_encoding($content, 'HTML-ENTITIES', $encoding);
if ('UTF-8' !== $encoding) {
$content = mb_convert_encoding($content, 'UTF-8', $encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i'm not sure about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be HTML-ENTITIES

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is no need to test if it's UTF-8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using mb_convert_encoding to convert to HTML entities is deprecated in PHP 8.2 and did not function well previously : php/php-src@9308974

And we don't want to ue html_entity_decode because it will break the fixer if the user pass something like 1 > 3 or <3.

My concern was more about the fact that we set the charset to $encoding but then we encode the content to UTF-8. This is weird. Either we don't set the charset or we don't convert IMO.

@damienalexandre damienalexandre merged commit 9d0b3cc into master Nov 20, 2022
@damienalexandre damienalexandre deleted the php8.1 branch November 20, 2022 21:15
@damienalexandre
Copy link
Member

Thank you both 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants