Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

fzhinkin · 2024-04-25T09:12:46Z

As it was pointed out in #290 (comment), kotlinx-io converts different ill-formed UTF-8 subsequences differently: either the whole multi-code-point subsequence replaced with a single replacement character, or each code points is converted separately:

0xf0 0x89 0x89 <EOF> -> �
0xf0 0x89 0x89 0x89 <EOF> -> �
0xf0 0xf0 0xf0 <EOF> -> ��

The UTF-8 spec allows handling these ill-formed sequences whatever way we want as long as errors are somehow reported. However, such behavior looks a bit inconsistent and it's hard to reason about how an arbitrary byte sequences will be converted.

We should improve the way ill-formed sequences are handled and stick to an approach adopted by other languages/libraries: convert only ill-formed subsequences consisting of a single byte.

That's how it's done in:

Java:

jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89})
$5 ==> "����"

Python 3:

>>> b'\xf0\x89\x89\x89'.decode("utf-8", errors='replace')
'����

Go:

fmt.Println(string([]byte{0xf0, 0x89, 0x89, 0x89}))
...

����

The text was updated successfully, but these errors were encountered:

ilya-g · 2024-05-01T12:06:16Z

See also the recommendation "U+FFFD Substitution of Maximal Subparts" in https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf

fzhinkin · 2024-10-01T20:48:43Z

It seems like kotlinx-io behavior could be aligned w/ Kotlin Stdlib (ByteArray.decodeToString in particular) in all scenarios except surrogate code points handling. On JVM, byte-sequences encoding surrogate code points are replaced with a single �, on all other platforms with ��:

ubyteArrayOf(0xedu, 0xbfu, 0xbfu).asByteArray().decodeToString()

https://pl.kotl.in/LMjgmMVGX

fzhinkin added enhancement encodings labels Apr 25, 2024

fzhinkin mentioned this issue Apr 25, 2024

Improve test coverage #290

Merged

fzhinkin added this to the kotlinx-io stabilization milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

fzhinkin commented Apr 25, 2024

ilya-g commented May 1, 2024

fzhinkin commented Oct 1, 2024

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Comments

fzhinkin commented Apr 25, 2024

ilya-g commented May 1, 2024

fzhinkin commented Oct 1, 2024