Condition for handling malformed UTF-8; also an interface to iconv #4837
Labels
A-Unicode
Area: Unicode
C-enhancement
Category: An issue proposing an enhancement or a PR with one.
E-easy
Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Currently even this simple
cat
program:...fails on the broken or invalid UTF-8 strings (or possibly in other character encodings, as this example illustrates):
...due to the byte sequence is assumed to be in UTF-8 (which is not). But there is currently no standard way to fix broken UTF-8 strings by replacing offending substrings by some other valid UTF-8, so it is hard to fix this kind of bugs.
This issue is ultimately linked to the general character encoding handling (libiconv binding, perhaps?) and a strict distinction between byte sequence and Unicode (UTF-8) string. I found Python's approach reasonable (bytes and str are separated, converted to each other via
encode
anddecode
methods, normal fileopen
reads bytes,codecs.open
with an encoding converts them to str), but I'm really not sure about the actual interface.The text was updated successfully, but these errors were encountered: