Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

flenniken · 2022-02-13T20:34:30Z

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices.

The Unicode specification says:

An increasing number of implementations are adopting the handling of
ill-formed subsequences as specified in the W3C standard for encoding
to achieve consistent U+FFFD replacements.

See:

Unicode 14.0 -- Unicode 14.0 Sp
ecification -- Conformance page 126, section 3.9.
w3.org Encoding -- w3.org encoding

For example, the hex byte sequence:

gets encoded as:

instead of:

Here are a few more examples:

Perl decode: e0 80 80
expected: ef bf bd ef bf bd ef bf bd
got: ef bf bd

Perl decode: f0 80 80 80
expected: ef bf bd ef bf bd ef bf bd ef bf bd
got: ef bf bd

Perl decode: ed ae 80 ed b0 80
expected: ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
got: ef bf bd ef bf bd

See https://github.com/flenniken/utf8tests for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

flenniken commented Feb 13, 2022

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

Comments

flenniken commented Feb 13, 2022