You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As it was pointed out in #290 (comment), kotlinx-io converts different ill-formed UTF-8 subsequences differently: either the whole multi-code-point subsequence replaced with a single replacement character, or each code points is converted separately:
0xf0 0x89 0x89 <EOF> -> �
0xf0 0x89 0x89 0x89 <EOF> -> �
0xf0 0xf0 0xf0 <EOF> -> ���
The UTF-8 spec allows handling these ill-formed sequences whatever way we want as long as errors are somehow reported. However, such behavior looks a bit inconsistent and it's hard to reason about how an arbitrary byte sequences will be converted.
We should improve the way ill-formed sequences are handled and stick to an approach adopted by other languages/libraries: convert only ill-formed subsequences consisting of a single byte.
That's how it's done in:
Java:
jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89})
$5 ==> "����"
It seems like kotlinx-io behavior could be aligned w/ Kotlin Stdlib (ByteArray.decodeToString in particular) in all scenarios except surrogate code points handling. On JVM, byte-sequences encoding surrogate code points are replaced with a single �, on all other platforms with ���:
As it was pointed out in #290 (comment), kotlinx-io converts different ill-formed UTF-8 subsequences differently: either the whole multi-code-point subsequence replaced with a single replacement character, or each code points is converted separately:
0xf0 0x89 0x89 <EOF>
->�
0xf0 0x89 0x89 0x89 <EOF>
->�
0xf0 0xf0 0xf0 <EOF>
->���
The UTF-8 spec allows handling these ill-formed sequences whatever way we want as long as errors are somehow reported. However, such behavior looks a bit inconsistent and it's hard to reason about how an arbitrary byte sequences will be converted.
We should improve the way ill-formed sequences are handled and stick to an approach adopted by other languages/libraries: convert only ill-formed subsequences consisting of a single byte.
That's how it's done in:
The text was updated successfully, but these errors were encountered: