Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

Closed
TakWolf opened this issue Feb 13, 2023 · 14 comments
Closed

euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863

TakWolf opened this issue Feb 13, 2023 · 14 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@TakWolf
Copy link

TakWolf commented Feb 13, 2023

Bug report

char '0x3164' can be encode('ksx1001'), but can not decode('ksx1001')

def main():
    code_point = 0x3164
    c = chr(code_point)
    raw = c.encode('ksx1001')
    c2 = raw.decode('ksx1001')  # <--- this cause error 
    print(f'{c} {c2}')

if __name__ == '__main__':
    main()
Traceback (most recent call last):
  File "/Users/takwolf/Develop/FontDev/fusion-pixel-font/build.py", line 11, in <module>
    main()
  File "/Users/takwolf/Develop/FontDev/fusion-pixel-font/build.py", line 6, in main
    c2 = raw.decode('ksx1001')
         ^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'euc_kr' codec can't decode bytes in position 0-1: incomplete multibyte sequence

The char is Hangul Compatibility Jamo -> Hangul Filler

https://unicode-table.com/en/3164/

image

The following code is get the zone in ks-x-1001:

def main():
    code_point = 0x3164
    c = chr(code_point)
    raw = c.encode('ksx1001')
    block_offset = 0xA0
    zone_1 = raw[0] - block_offset
    zone_2 = raw[1] - block_offset
    print(f'{zone_1} {zone_2}')


if __name__ == '__main__':
    main()
zone_1 = 4 
zone_2 = 52

https://en.wikipedia.org/wiki/KS_X_1001#Hangul_Filler
image

image

other chars in ksx1001 encode an decode is ok, but only this.

Your environment

  • CPython versions tested on: Python 3.11.1
  • Operating system and architecture: macOS 13.0

Linked PRs

@TakWolf TakWolf added the type-bug An unexpected behavior, bug, or error label Feb 13, 2023
@arhadthedev arhadthedev added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Feb 13, 2023
@arhadthedev
Copy link
Member

@ezio-melotti (as an active Unicode expert)

@corona10
Copy link
Member

I will take a look at this issue as the internal OSS sprint issue.
I will delegate this issue to my mentees.

@b8choi
Copy link
Contributor

b8choi commented Feb 20, 2023

I will try to resolve this issue.

@TakWolf
Copy link
Author

TakWolf commented Feb 23, 2023

I'm not a python expert.

str.encode() and str.decode() is a builtin fun?

I found the coding logic maybe here:

https://github.com/python/cpython/tree/main/Lib/encodings

https://github.com/python/cpython/blob/main/Lib/encodings/euc_kr.py#L7-L10

that register a codec named 'euc_kr' to provide encode('euc_kr')

and import _codecs_kr is a c source?

https://github.com/python/cpython/blob/main/Modules/cjkcodecs/_codecs_kr.c#L7-L9

and mappings_kr.h

https://github.com/python/cpython/blob/main/Modules/cjkcodecs/mappings_kr.h

It seems that some arrays are provided here to encode/decode query.

static const ucs2_t __ksx1001_decmap[8264]
static const struct dbcs_index ksx1001_decmap[256]
static const ucs2_t __cp949ext_decmap[9650]
static const struct dbcs_index cp949ext_decmap[256] 
static const DBCHAR __cp949_encmap[33133]
static const struct unim_index cp949_encmap[256]

and mappings_kr.h is Generate by https://github.com/python/cpython/blob/main/Tools/unicode/genmap_korean.py

https://github.com/python/cpython/blob/main/Tools/unicode/genmap_korean.py#L12-L58

I don't understand why the cp949encmap size not equals the ksx1001decmap size?

Is euc_kr bi-directional in rules? That if can be encoded, it must can be decoded?

Is CP949 equals to ksx1001?

@corona10
Copy link
Member

Is CP949 equals to ksx1001?

Not equals

@corona10
Copy link
Member

I don't understand why the cp949encmap size not equals the ksx1001decmap size?

Because CP949 is a kind of extended version of KSX1001

@TakWolf
Copy link
Author

TakWolf commented Feb 23, 2023

iterate unicode :

def main():
    encode_count = 0
    decode_count = 0
    for code_point in range(0, 0x10FFFF + 1):
        c = chr(code_point)
        try:
            raw = c.encode('ksx1001')
            encode_count += 1
            try:
                raw.decode('ksx1001')
                decode_count += 1
            except:
                pass
        except:
            pass
    print(f'encode: {encode_count}')
    print(f'decode: {decode_count}')
encode: 17176
decode: 17175

iterate ksx1001 :

def main():
    decode_count = 0
    for zone_1 in range(1, 94 + 1):
        for zone_2 in range(1, 94 + 1):
            try:
                bytes([zone_1 + 0xA0, zone_2 + 0xA0]).decode('ksx1001')
                decode_count += 1
            except UnicodeDecodeError:
                pass
    print(f'decode: {decode_count}')
decode: 8225

@TakWolf
Copy link
Author

TakWolf commented Feb 23, 2023

def main():
    not_2_bytes_alphabet = list()
    not_2_bytes_count = 0
    ksx1001_other_count = 0
    ksx1001_syllable_count = 0
    ksx1001_word_count = 0
    ksx1001_but_not_a_char_count = 0
    for code_point in range(0, 0x10FFFF + 1):
        c = chr(code_point)
        try:
            raw = c.encode('ksx1001')
        except UnicodeEncodeError:
            continue

        if len(raw) != 2:
            not_2_bytes_count += 1
            not_2_bytes_alphabet.append(c)
            continue

        zone_1 = raw[0] - 0xA0
        if 1 <= zone_1 <= 12:
            ksx1001_other_count += 1
        elif 16 <= zone_1 <= 40:
            ksx1001_syllable_count += 1
        elif 42 <= zone_1 <= 93:
            ksx1001_word_count += 1
        else:  # zone_1 is 13~15, 41, 94
            ksx1001_but_not_a_char_count += 1

    print(f'ksx other + syllable + word :{ksx1001_other_count + ksx1001_syllable_count + ksx1001_word_count}')
    print(f'other: {ksx1001_other_count}')
    print(f'syllable: {ksx1001_syllable_count}')
    print(f'word: {ksx1001_word_count}')
    print(f'in ksx but not a char: {ksx1001_but_not_a_char_count}')

    print(f'not 2 len bytes: {not_2_bytes_count}')
    print(f'not 2 len bytes + syllable: {not_2_bytes_count + ksx1001_syllable_count}')
    print('unicode all syllable: 11172')
    print(''.join(not_2_bytes_alphabet))
ksx other + syllable + word :8226     # 8225 + 1(Hangul Filler)
other: 988
syllable: 2350
word: 4888
in ksx but not a char: 0

not 2 len bytes: 8950     # <---- what's this ? Include other syllables not in ksx1001 2350?
not 2 len bytes + syllable: 11300
unicode all syllable: 11172

not_2_bytes_alphabet:

�������	
������������������ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~�갂갃갅갆갋 ......  # Behind them are all syllables

@b8choi
Copy link
Contributor

b8choi commented Mar 2, 2023

From my investigation so far, there are some unclear aspects of the decoding process related to Korean characters encoded in EUC-KR.

FYI, the KS X 1001 standard defines the Korean character set, while EUC-KR is an encoding method for this set. In Python, "ksx1001" is an alias of "euc_kr".

Korean alphabet, also known as Hangul, expresses one syllable consisting of consonants and vowels as one letter. The KS X 1001 standard defines a precomposed set of characters that correspond to one syllable rather than individual consonants and vowels. However, to express a character not included in this set, a method of composing consonants and vowels is defined. The "Hangul Filler" character is used for this purpose, and it, along with the following three consonants and vowels, are arranged to express one character in a total of 8 bytes.

The KS X 1001 standard does not provide any specific guidelines other than the method for expressing one character using the "Hangul Filler".

Korean characters not included in this standard are decomposed into consonants and vowels, and the Korean alphabet code value of the special character area corresponding to each part is used for information exchange. To differentiate it from information exchange for individual consonant and vowel characters, one "Hangul Filler" is added in front of each syllable.

The main issue is whether to allow an independent "Hangul Filler" that is not followed by valid consonants and vowels to compose a single character. Currently, the EUC-KR encoder does not validate strings containing "Hangul Filler" and just convert "Hangul Filler" of KS X 1001 to the corresponding Unicode "Hangul Filler".

I think there are three possible solutions to this issue, which need to be discussed:

  1. Not fix anything: allow encoding to contain independent "Hangul Filler" characters, while decoding will treat an sequence starting with "Hangul Filler" as an error if it does not represent a valid combination of Korean characters.
  2. Fix the encoder: strings that contain independent "Hangul Filler" will be treated as an error (this may cause new errors in the existing codes).
  3. Fix the decoder: a valid sequence will be converted into one composed character, and in other cases, the code of each character will be simply converted into the corresponding code in Unicode.

I have tried the third method and confirmed that no errors have occurred.

For more information, please refer to the KS X 1001 specification and a detailed description of "Hangul Filler" character.

@corona10
Copy link
Member

corona10 commented Mar 2, 2023

I prefer supporting the round-trip approach since we already support encoding.
But I would need to listen to the advice from the other core devs :)

cc @ezio-melotti @vstinner

@corona10
Copy link
Member

corona10 commented Mar 2, 2023

@b8choi

I have tried the third method and confirmed that no errors have occurred.

Would you like to submit the patch anyway? We may need to analyze the side effect when we support the decoding for the character.
But please submit the patch with a nice test code.

carljm added a commit to carljm/cpython that referenced this issue Mar 6, 2023
* main: (21 commits)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472)
  pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455)
  pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467)
  pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667)
  pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664)
  pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665)
  pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445)
  pythonGH-102341: Improve the test function for pow (python#102342)
  Fix unused classes in a typing test (pythonGH-102437)
  pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318)
  pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426)
  Move around example in to_bytes() to avoid confusion (python#101595)
  pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421)
  pythongh-96821: Add config option `--with-strict-overflow` (python#96823)
  pythongh-101992: update pstlib module documentation (python#102133)
  pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699)
  pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417)
  pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303)
  pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180)
  pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896)
  ...
carljm added a commit to carljm/cpython that referenced this issue Mar 7, 2023
* main: (37 commits)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472)
  pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455)
  pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467)
  pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667)
  pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664)
  pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665)
  pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445)
  pythonGH-102341: Improve the test function for pow (python#102342)
  Fix unused classes in a typing test (pythonGH-102437)
  pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318)
  pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426)
  Move around example in to_bytes() to avoid confusion (python#101595)
  pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421)
  pythongh-96821: Add config option `--with-strict-overflow` (python#96823)
  pythongh-101992: update pstlib module documentation (python#102133)
  pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699)
  pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417)
  pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303)
  pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180)
  pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896)
  ...
@corona10
Copy link
Member

Since the decoding implementation for this case is undefined and difficult to support decoding for this edge case due to streaming issues.
I close this issue as not a bug.

If someone think that this is a bug please let me know.

@TakWolf
Copy link
Author

TakWolf commented Mar 14, 2023

So far, I still haven't got an explanation that chr(0x3164).encode('ksx1001').decode('ksx1001') will get a exception is or not a expected behavior (Why python do like this? What is the document or implements reference?).

At least in my understanding, the encoder should be idempotent (That if some char can be encoded, it should be decoded).
For example there are some control characters. They have special purposes in the display layer, but this does not affect that they can be encoded and decoded.

I also deal with other encoding standards ( gb2312, shitf-jis, big5), that have not the similar behaviors (They are all idempotent).

And this problem does not exist in Java. (I will test later in Rust , Go and node-Javascript)

image
image

EUC was used to solve computer coding in the early stage, if they can't convert with unicode(utf-8) equivalent, it seems incredible to me.

@TakWolf
Copy link
Author

TakWolf commented Mar 14, 2023

My opinion, encoding is a storage structure, data should not be changed through coders but keep it as it is.
So for b8choi's proposal 2 and 3, they are not suitable.
If the char is in the scope, coder should not throw a exception. And coder should not changed the data.
(Imagine that you just browse the ksx file in a utf8 system, the data is changed, this is not reasonable)
Ksx can be fully mapped to unicode, convert the hangul is meaningless. There is also a Hangul Filler in Unicode. Should we convert all the Hangul Filler combin chars to hangul in utf-8 data layer?

An evidence that may support my opinion, there is a Emoji Combinations in Unicode.
For example: 👩‍❤️‍💋‍👨 = 👩+❤️+‍💋+‍👨
This just a display, on data they are equal.

Hangul Filler ’s role should be similar to Ideographic Description Characters only effect on display layer.
If the text engine support Hangul Filler, it will display hangul only on display. If not display as char is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants