Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename CharString -> UTF32String (fix #4943) #4946

Merged
merged 1 commit into from
Nov 27, 2013

Conversation

stevengj
Copy link
Member

As explained in #4943, this deprecates CharString in favor of UTF32String.

@JeffBezanson
Copy link
Member

I don't think it makes sense to store BOMs in memory --- they are only needed for data exchange, and even then they're not always used.

JeffBezanson added a commit that referenced this pull request Nov 27, 2013
rename CharString -> UTF32String (fix #4943)
@JeffBezanson JeffBezanson merged commit 532f639 into JuliaLang:master Nov 27, 2013
@StefanKarpinski
Copy link
Member

That wasn't actually my concern. My concern was that a Char only corresponds to a UTF-32 character when the endianness matches the platform. Thus, in theory we need both big and little variants of both UTF-16 and UTF-32 – and representing the data of a non-native UTF-32 string as an array of Chars doesn't make any sense.

@stevengj
Copy link
Member Author

@StefanKarpinski, the statement that "a Char only corresponds to a UTF-32 character when the endianness matches the platform" doesn't make sense to me. If UTF-32 strings (arrays of Unicode codepoints) are allowed to be encoded in either byte order, then we are free to encode them in the native byte order since that is most convenient for us. The main question is when we should prepend a BOM marker.

Mainly this just seems like a serialization/deserialization issue. There should be a method to serialize/deserialize UTF-16 and UTF-32 with the BOM prepended. When it deserializes a stream and detects an endianness mismatch in the BOM, it just needs to swap the byte order as it loads. This is part of the standard, as I understand it.

And/or we could include the BOM by default when we convert a UTF-32 string to an Array{Uint32} or Ptr{Uint32}.

(Similarly for UTF-16.)

@StefanKarpinski
Copy link
Member

We can certainly encode them any way we want – and native byte order is the obvious choice. However, when reading in UTF-32 data from an external source, it may be in either byte order. If we're going to assume that UTF-32 strings are always in native byte order, then that implies that we will put the bytes into native byte order before using them in any way. That's certainly possible, but it's a fairly significant design decision about how our I/O is going to work. Also, if we're going to potentially be byte-swapping every UTF-16 and UTF-32 string we read, maybe we should just transcode them to UTF-8 and always deal with UTF-8 internally – although transcoding is significantly more complicated since it can't always be done in-place, whereas byte swapping obviously can.

@JeffBezanson
Copy link
Member

Better to design it so byte swapping is sometimes needed, since that probably won't be the common case.
Byte swapping when reading from a stream is a general concern, applying to all data types wider than a byte. To the extent that something like read(io, Uint32) exists, byte order is an issue.

@StefanKarpinski
Copy link
Member

For a lot of those cases, it probably makes sense to have a byte-swapping I/O "layer" so that each operation doesn't have to worry about it and reading a Uint32 will just work.

@StefanKarpinski
Copy link
Member

And yes, that's fair enough since this probably won't be common. But I did think it was worth mentioning at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants