-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Support and UTF Encoding #164
Comments
Currently character strings (as opposed to base strings) are UTF-32. It looks like we have UTF-8 and UCS-4 (which is the same as UTF-32, I think) as external formats. We also have UCS-2 which is not actually equivalent to UTF-16 I think. We don't have any support for unicode algorithms like normal forms, etc. We do not handle some character functions correctly, e.g. |
a potentially crazy idea: integrate babel into the build/bootstrap process and use it? https://github.com/cl-babel/babel it only covers the external format reading/writing/conversion, though. |
Constant time access to code points isn't really that useful. Most situations where you would think it's what you want, it's actually not, because code points don't correspond exactly with anything meaningful user-facing characters. |
It is in the context of common lisp because most sequence functions use |
I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden? |
Yes. Strings are vectors, and violating the assumption that they can be accessed in constant time per index will lead to lots of problems. You'd have to devise a separate string type entirely, with an API that does not follow the vectors one. I think @Bike did make a library for utf-8 strings using the extensible sequences API as a proof of concept once, for instance. |
In the "sequences" extension to the language, you can implement an iteration protocol which is then used by standard sequence functions (map, reduce, etc.). But strings are vectors, and the iterator implementation for vectors does use indices. As Shinmera said, I did write a library to use UTF-8 encoded strings, which defines iteration without relying on indices so much. For example here is the "next" function: https://github.com/Bike/utf8string/blob/master/utf8string.lisp#L456-L464 Could I ask why this issue has been revived? I mean, we never closed it, because we lack some aspects of Unicode support (like anything analogous to |
Unicode is the de-facto standard encoding for text nowadays. As such, Clasp must support it in order to be able to run a lot of useful software. As an initial suggestion, using UTF-32 internally for
string
would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings. The size should not be a problem on modern systems. For external formats, UTF-8 and UTF-16 support should also be added.Since Clasp's main purpose is interaction with C++ libraries, a variety of support functions and mechanisms might have to be added to ease the conversion and sharing of string data between Clasp and external or bound libraries. This might necessitate supporting different string representation formats internally to allow relatively efficient handling of strings without having to rely on conversion every time the Clasp/Library boundary is overstepped.
The text was updated successfully, but these errors were encountered: