-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rename CharString -> UTF32String (fix #4943) #4946
Conversation
I don't think it makes sense to store BOMs in memory --- they are only needed for data exchange, and even then they're not always used. |
rename CharString -> UTF32String (fix #4943)
That wasn't actually my concern. My concern was that a Char only corresponds to a UTF-32 character when the endianness matches the platform. Thus, in theory we need both big and little variants of both UTF-16 and UTF-32 – and representing the data of a non-native UTF-32 string as an array of Chars doesn't make any sense. |
@StefanKarpinski, the statement that "a Mainly this just seems like a serialization/deserialization issue. There should be a method to serialize/deserialize UTF-16 and UTF-32 with the BOM prepended. When it deserializes a stream and detects an endianness mismatch in the BOM, it just needs to swap the byte order as it loads. This is part of the standard, as I understand it. And/or we could include the BOM by default when we convert a UTF-32 string to an (Similarly for UTF-16.) |
We can certainly encode them any way we want – and native byte order is the obvious choice. However, when reading in UTF-32 data from an external source, it may be in either byte order. If we're going to assume that UTF-32 strings are always in native byte order, then that implies that we will put the bytes into native byte order before using them in any way. That's certainly possible, but it's a fairly significant design decision about how our I/O is going to work. Also, if we're going to potentially be byte-swapping every UTF-16 and UTF-32 string we read, maybe we should just transcode them to UTF-8 and always deal with UTF-8 internally – although transcoding is significantly more complicated since it can't always be done in-place, whereas byte swapping obviously can. |
Better to design it so byte swapping is sometimes needed, since that probably won't be the common case. |
For a lot of those cases, it probably makes sense to have a byte-swapping I/O "layer" so that each operation doesn't have to worry about it and reading a |
And yes, that's fair enough since this probably won't be common. But I did think it was worth mentioning at least. |
As explained in #4943, this deprecates
CharString
in favor ofUTF32String
.