-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
euc_kr char '0x3164' decode('ksx1001') cause UnicodeDecodeError #101863
Comments
@ezio-melotti (as an active Unicode expert) |
I will take a look at this issue as the internal OSS sprint issue. |
I will try to resolve this issue. |
I'm not a python expert.
I found the coding logic maybe here: https://github.com/python/cpython/tree/main/Lib/encodings https://github.com/python/cpython/blob/main/Lib/encodings/euc_kr.py#L7-L10 that register a codec named 'euc_kr' to provide and https://github.com/python/cpython/blob/main/Modules/cjkcodecs/_codecs_kr.c#L7-L9 and https://github.com/python/cpython/blob/main/Modules/cjkcodecs/mappings_kr.h It seems that some arrays are provided here to encode/decode query.
and https://github.com/python/cpython/blob/main/Tools/unicode/genmap_korean.py#L12-L58 I don't understand why the Is Is CP949 equals to ksx1001? |
Not equals |
Because CP949 is a kind of extended version of KSX1001 |
iterate unicode : def main():
encode_count = 0
decode_count = 0
for code_point in range(0, 0x10FFFF + 1):
c = chr(code_point)
try:
raw = c.encode('ksx1001')
encode_count += 1
try:
raw.decode('ksx1001')
decode_count += 1
except:
pass
except:
pass
print(f'encode: {encode_count}')
print(f'decode: {decode_count}')
iterate ksx1001 : def main():
decode_count = 0
for zone_1 in range(1, 94 + 1):
for zone_2 in range(1, 94 + 1):
try:
bytes([zone_1 + 0xA0, zone_2 + 0xA0]).decode('ksx1001')
decode_count += 1
except UnicodeDecodeError:
pass
print(f'decode: {decode_count}')
|
def main():
not_2_bytes_alphabet = list()
not_2_bytes_count = 0
ksx1001_other_count = 0
ksx1001_syllable_count = 0
ksx1001_word_count = 0
ksx1001_but_not_a_char_count = 0
for code_point in range(0, 0x10FFFF + 1):
c = chr(code_point)
try:
raw = c.encode('ksx1001')
except UnicodeEncodeError:
continue
if len(raw) != 2:
not_2_bytes_count += 1
not_2_bytes_alphabet.append(c)
continue
zone_1 = raw[0] - 0xA0
if 1 <= zone_1 <= 12:
ksx1001_other_count += 1
elif 16 <= zone_1 <= 40:
ksx1001_syllable_count += 1
elif 42 <= zone_1 <= 93:
ksx1001_word_count += 1
else: # zone_1 is 13~15, 41, 94
ksx1001_but_not_a_char_count += 1
print(f'ksx other + syllable + word :{ksx1001_other_count + ksx1001_syllable_count + ksx1001_word_count}')
print(f'other: {ksx1001_other_count}')
print(f'syllable: {ksx1001_syllable_count}')
print(f'word: {ksx1001_word_count}')
print(f'in ksx but not a char: {ksx1001_but_not_a_char_count}')
print(f'not 2 len bytes: {not_2_bytes_count}')
print(f'not 2 len bytes + syllable: {not_2_bytes_count + ksx1001_syllable_count}')
print('unicode all syllable: 11172')
print(''.join(not_2_bytes_alphabet))
not_2_bytes_alphabet:
|
From my investigation so far, there are some unclear aspects of the decoding process related to Korean characters encoded in EUC-KR. FYI, the KS X 1001 standard defines the Korean character set, while EUC-KR is an encoding method for this set. In Python, "ksx1001" is an alias of "euc_kr". Korean alphabet, also known as Hangul, expresses one syllable consisting of consonants and vowels as one letter. The KS X 1001 standard defines a precomposed set of characters that correspond to one syllable rather than individual consonants and vowels. However, to express a character not included in this set, a method of composing consonants and vowels is defined. The "Hangul Filler" character is used for this purpose, and it, along with the following three consonants and vowels, are arranged to express one character in a total of 8 bytes. The KS X 1001 standard does not provide any specific guidelines other than the method for expressing one character using the "Hangul Filler".
The main issue is whether to allow an independent "Hangul Filler" that is not followed by valid consonants and vowels to compose a single character. Currently, the EUC-KR encoder does not validate strings containing "Hangul Filler" and just convert "Hangul Filler" of KS X 1001 to the corresponding Unicode "Hangul Filler". I think there are three possible solutions to this issue, which need to be discussed:
For more information, please refer to the KS X 1001 specification and a detailed description of "Hangul Filler" character. |
I prefer supporting the round-trip approach since we already support encoding. |
Would you like to submit the patch anyway? We may need to analyze the side effect when we support the decoding for the character. |
* main: (21 commits) pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472) pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455) pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467) pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667) pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665) pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445) pythonGH-102341: Improve the test function for pow (python#102342) Fix unused classes in a typing test (pythonGH-102437) pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318) pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426) Move around example in to_bytes() to avoid confusion (python#101595) pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421) pythongh-96821: Add config option `--with-strict-overflow` (python#96823) pythongh-101992: update pstlib module documentation (python#102133) pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699) pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417) pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303) pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180) pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896) ...
* main: (37 commits) pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives in sub interpreters module (python#102472) pythongh-95672: Fix versionadded indentation of get_pagesize in test.rst (pythongh-102455) pythongh-102416: Do not memoize incorrectly loop rules in the parser (python#102467) pythonGH-101362: Optimise PurePath(PurePath(...)) (pythonGH-101667) pythonGH-101362: Check pathlib.Path flavour compatibility at import time (pythonGH-101664) pythonGH-101362: Call join() only when >1 argument supplied to pathlib.PurePath() (python#101665) pythongh-102444: Fix minor bugs in `test_typing` highlighted by pyflakes (python#102445) pythonGH-102341: Improve the test function for pow (python#102342) Fix unused classes in a typing test (pythonGH-102437) pythongh-101979: argparse: fix a bug where parentheses in metavar argument of add_argument() were dropped (python#102318) pythongh-102356: Add thrashcan macros to filter object dealloc (python#102426) Move around example in to_bytes() to avoid confusion (python#101595) pythonGH-97546: fix flaky asyncio `test_wait_for_race_condition` test (python#102421) pythongh-96821: Add config option `--with-strict-overflow` (python#96823) pythongh-101992: update pstlib module documentation (python#102133) pythongh-63301: Set exit code when tabnanny CLI exits on error (python#7699) pythongh-101863: Fix wrong comments in EUC-KR codec (pythongh-102417) pythongh-102302 Micro-optimize `inspect.Parameter.__hash__` (python#102303) pythongh-102179: Fix `os.dup2` error reporting for negative fds (python#102180) pythongh-101892: Fix `SystemError` when a callable iterator call exhausts the iterator (python#101896) ...
Since the decoding implementation for this case is undefined and difficult to support decoding for this edge case due to streaming issues. If someone think that this is a bug please let me know. |
So far, I still haven't got an explanation that At least in my understanding, the encoder should be idempotent (That if some char can be encoded, it should be decoded). I also deal with other encoding standards ( gb2312, shitf-jis, big5), that have not the similar behaviors (They are all idempotent). And this problem does not exist in Java. (I will test later in Rust , Go and node-Javascript) EUC was used to solve computer coding in the early stage, if they can't convert with unicode(utf-8) equivalent, it seems incredible to me. |
My opinion, encoding is a storage structure, data should not be changed through coders but keep it as it is. An evidence that may support my opinion, there is a Emoji Combinations in Unicode. Hangul Filler ’s role should be similar to Ideographic Description Characters only effect on display layer. |
Bug report
char '0x3164' can be
encode('ksx1001')
, but can notdecode('ksx1001')
The char is
Hangul Compatibility Jamo -> Hangul Filler
https://unicode-table.com/en/3164/
The following code is get the zone in ks-x-1001:
https://en.wikipedia.org/wiki/KS_X_1001#Hangul_Filler
other chars in ksx1001 encode an decode is ok, but only this.
Your environment
Linked PRs
The text was updated successfully, but these errors were encountered: