-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider UTF-32 support #545
Comments
CC @hsivonen |
If there is interest in interfacing with Python on that level instead of going via UTF-8, I guess that's a use case, then. Note: Python doesn't guarantee UTF-32 validity: The 32-bit-code-unit strings can contain lone surrogates, so if the use case is interfacing with Python without UTF-8 conversion, ICU4X would need to check each code unit for being in the range for Rust |
I would hypothetically be interested in writing a Python binding once icu4x has a C API, as an alternative to the very un-Pythonic and under-documented PyICU. (I’m the author of the PyICU cheat sheet, which is afaik the only API documentation specific to ICU in Python — otherwise, you’re just referred to the C++ API and left to work out how it maps on to Python for yourself.) |
Author of PyICU here: if you find PyICU very unpythonic, please provide concrete examples about how you're doing something with PyICU and how you'd suggest it be done instead in order be more pythonic. I'm happy to either fix actual un-pythonic examples of PyICU use-cases or show you how it's done. Please, be very specific, there already is a lot of built-in python iterator support, for example, that you may just not know about. It's ok to ask and suggest improvements (!) |
For example, from your cheat sheet, you seem to not know that BreakIterator is a python iterator:
Yes, I understand, you'd prefer the actual words to be returned but that's not how ICU designed the BreakIterator, they chose to give you boundaries instead. |
We will revisit string encodings as we approach ICU4X v1. |
Discussion:
|
This should not be taken as an endorsement of UTF-32, but as a matter of how hard things would be for the collator specifically:
The collator and the decomposing normalizer consume an iterator over |
Since most strings don't contain supplementary-plane characters, supporting UTF-32 wouldn't really help: If most Python strings were converted to UTF-32 upon ICU4X API boundary, they might as well be converted to UTF-8 unless there are indices returned. Indices are relevant to the segmenter. In that case, it might actually help Python to convert to UTF-32 and then segment that. Other than that, the case where avoiding conversion to UTF-8 might make sense is the collator, which performs a lot of string reading without modification. However, to have the collator operate without having to create (converted) copies of the Python string data, there'd need to be 6 methods:
The remaining of the nine cases are mirror cases of the last three, so no point in generating code for those separately. Note that a surrogate pair in a Python string has the semantics of a surrogate and another surrogate. The result does not have supplementary-plane semantics. I haven't checked if surrogates promote to 32-bit code units or if the 16-bit-code-unit representation can have surrogates that don't have UTF-16 semantics. That is, it's unclear to me if item 2 can reuse the UTF-16 to UTF-16 comparison specialization. Note that the raw Python data representation is available via PyO3 "on non-Py_LIMITED_API and little-endian only". If someone really cares, it would make sense to benchmark the collator with these 6 variants (out-of-repo) vs. converting to UTF-8 and then using the |
string_representation.md:
There is one significant use of UTF-32 in the real world: Python’s so-called ‘flexible string representation’. See PEP 393. The short version: Python internally stores strings as Latin-1 if they only contain characters ≤ U+00FF; as UTF-16 (guaranteed valid, fwiw) if they contain only characters in the BMP; or otherwise as UTF-32. This is intended to provide the most efficient representation for a majority of strings while retaining O(1) string indexing — it’s much like what the document says about what SpiderMonkey and V8 do, but since Python string indexing, unlike JS string indexing, returns real codepoints and not UTF-16 code units, it adds an extra upgrade to UTF-32.
In the Scheme world, R7RS Large can reasonably be expected to require that codepoint indexing into strings (or some variant of strings — it’s possible we’ll end up with a string/text split like Haskell’s) be O(1), so I expect UTF-32 or Python-style flexible string representation to become common in that context, too.
(Also, before flexible string representation was introduced into Python, UTF-32 was used for all strings.)
The text was updated successfully, but these errors were encountered: