rename CharString -> UTF32String (fix #4943) #4946

stevengj · 2013-11-26T20:50:08Z

As explained in #4943, this deprecates CharString in favor of UTF32String.

JeffBezanson · 2013-11-27T01:45:29Z

I don't think it makes sense to store BOMs in memory --- they are only needed for data exchange, and even then they're not always used.

rename CharString -> UTF32String (fix #4943)

StefanKarpinski · 2013-11-27T01:53:55Z

That wasn't actually my concern. My concern was that a Char only corresponds to a UTF-32 character when the endianness matches the platform. Thus, in theory we need both big and little variants of both UTF-16 and UTF-32 – and representing the data of a non-native UTF-32 string as an array of Chars doesn't make any sense.

stevengj · 2013-11-27T02:07:17Z

@StefanKarpinski, the statement that "a Char only corresponds to a UTF-32 character when the endianness matches the platform" doesn't make sense to me. If UTF-32 strings (arrays of Unicode codepoints) are allowed to be encoded in either byte order, then we are free to encode them in the native byte order since that is most convenient for us. The main question is when we should prepend a BOM marker.

Mainly this just seems like a serialization/deserialization issue. There should be a method to serialize/deserialize UTF-16 and UTF-32 with the BOM prepended. When it deserializes a stream and detects an endianness mismatch in the BOM, it just needs to swap the byte order as it loads. This is part of the standard, as I understand it.

And/or we could include the BOM by default when we convert a UTF-32 string to an Array{Uint32} or Ptr{Uint32}.

(Similarly for UTF-16.)

StefanKarpinski · 2013-11-27T02:23:18Z

We can certainly encode them any way we want – and native byte order is the obvious choice. However, when reading in UTF-32 data from an external source, it may be in either byte order. If we're going to assume that UTF-32 strings are always in native byte order, then that implies that we will put the bytes into native byte order before using them in any way. That's certainly possible, but it's a fairly significant design decision about how our I/O is going to work. Also, if we're going to potentially be byte-swapping every UTF-16 and UTF-32 string we read, maybe we should just transcode them to UTF-8 and always deal with UTF-8 internally – although transcoding is significantly more complicated since it can't always be done in-place, whereas byte swapping obviously can.

JeffBezanson · 2013-11-27T02:31:43Z

Better to design it so byte swapping is sometimes needed, since that probably won't be the common case.
Byte swapping when reading from a stream is a general concern, applying to all data types wider than a byte. To the extent that something like read(io, Uint32) exists, byte order is an issue.

StefanKarpinski · 2013-11-27T02:39:07Z

For a lot of those cases, it probably makes sense to have a byte-swapping I/O "layer" so that each operation doesn't have to worry about it and reading a Uint32 will just work.

StefanKarpinski · 2013-11-27T02:39:54Z

And yes, that's fair enough since this probably won't be common. But I did think it was worth mentioning at least.

rename CharString -> UTF32String (fix JuliaLang#4943)

ce40f89

JeffBezanson added a commit that referenced this pull request Nov 27, 2013

Merge pull request #4946 from stevengj/utf32

532f639

rename CharString -> UTF32String (fix #4943)

JeffBezanson merged commit 532f639 into JuliaLang:master Nov 27, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rename CharString -> UTF32String (fix #4943) #4946

rename CharString -> UTF32String (fix #4943) #4946

stevengj commented Nov 26, 2013

JeffBezanson commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

stevengj commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

JeffBezanson commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

rename CharString -> UTF32String (fix #4943) #4946

rename CharString -> UTF32String (fix #4943) #4946

Conversation

stevengj commented Nov 26, 2013

JeffBezanson commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

stevengj commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

JeffBezanson commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013

StefanKarpinski commented Nov 27, 2013