Unicode characters are not correctly parsed from byte[] if default charset is not UTF-8 #73

dpeger · 2021-04-23T20:16:03Z

The TestUtf8.supportI18nBytes test is currently broken on windows systems. Actually in my IDE (IntelliJ) the test is green. but the maven test target fails both in the IDE and on command line:

[ERROR] Failures:
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Amharic text ==> expected: <አማርኛ> but was: <áŠ áˆ›áˆáŠ›>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Malayalam text ==> expected: <മലയാളം> but was: <à´®à´²à´¯à´¾à´³à´‚>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Assyrian Neo-Aramaic text ==> expected: <ܐܬܘܪܝܐ> but was: <Ü?Ü¬Ü˜ÜªÜ?Ü?>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Georgian text ==> expected: <მარგალური> but was: <áƒ›áƒ?áƒ áƒ’áƒ?áƒšáƒ£áƒ áƒ˜>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Sinhala text ==> expected: <සිංහල ජාතිය> but was: <à·ƒà·’à¶‚à·„à¶½ à¶¢à·?à¶à·’à¶º>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Japanese text ==> expected: <日本語> but was: <æ—¥æœ¬èªž>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Russian text ==> expected: <Русский> but was: <Ð ÑƒÑ?Ñ?ÐºÐ¸Ð¹>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Farsi text ==> expected: <فارسی> but was: <Ù?Ø§Ø±Ø³ÛŒ>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Korean text ==> expected: <한국어> but was: <í•œêµì–´>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Armenian text ==> expected: <Հայերեն> but was: <Õ€Õ¡ÕµÕ¥Ö€Õ¥Õ¶>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Hindi text ==> expected: <हिन्दी> but was: <à¤¹à¤¿à¤¨à¥?à¤¦à¥€>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Hebrew text ==> expected: <עברית> but was: <×¢×‘×¨×™×ª>
[ERROR] TestUtf8.supportI18nBytes:62 Parsing bytes[] Chinese text ==> expected: <中文> but was: <ä¸æ–‡>

The problem seems to be the creation of String using the system's default charset in JSONParserByteArray.extractString

json-smart-v2/json-smart/src/main/java/net/minidev/json/parser/JSONParserByteArray.java

Line 62 in 604281d

xs = new String(in, beginIndex, endIndex - beginIndex);

and JSONParserByteArray.extractStringTrim

json-smart-v2/json-smart/src/main/java/net/minidev/json/parser/JSONParserByteArray.java

Line 74 in 604281d

xs = new String(in, start, stop - start);

The text was updated successfully, but these errors were encountered:

…alization-from-byte-array [#73] Avoid `String` creation using system default charset

This was referenced Apr 23, 2021

[#73] Avoid String creation using system default charset #74

Merged

[#60][#62] Unchecked Exception in Parser #72

Merged

UrielCh closed this as completed in #74 Apr 24, 2021

UrielCh added a commit that referenced this issue Apr 24, 2021

Merge pull request #74 from dpeger/fixes/master/utf8-character-deseri…

9f6bb80

…alization-from-byte-array [#73] Avoid `String` creation using system default charset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode characters are not correctly parsed from byte[] if default charset is not UTF-8 #73

Unicode characters are not correctly parsed from byte[] if default charset is not UTF-8 #73

dpeger commented Apr 23, 2021

Unicode characters are not correctly parsed from byte[] if default charset is not UTF-8 #73

Unicode characters are not correctly parsed from byte[] if default charset is not UTF-8 #73

Comments

dpeger commented Apr 23, 2021