Define more conversions to UTF8String #13588

malmaud · 2015-10-13T18:41:24Z

stevengj · 2015-10-13T18:50:22Z

base/unicode/utf8.jl

@@ -312,6 +312,11 @@ function convert(::Type{UTF8String}, a::Vector{UInt8}, invalids_as::AbstractStri
 end
 convert(::Type{UTF8String}, s::AbstractString) = utf8(bytestring(s))

+function convert{T<:Union{UInt16, UInt32}}(::Type{UTF8String}, a::Vector{T}, len=length(a)-1)


utf16(a::Vector{UInt16}) does not assume that a is NUL-terminated (if a ends with a NUL codeunit, then the resulting string contains a NUL). So, it would be more consistent to use len=length(a) here, even if that means you need to do UTF8String(buf, length(buf)-1) at most of the call sites.

It would be a bigger change, but would now be the right time to revisit the nul termination of utf16 and utf32 strings?

stevengj · 2015-10-13T18:51:43Z

Needs a test case, and probably a doc update since this is a user-visible function.

tkelman · 2015-10-13T21:05:28Z

Last time around we argued pretty strongly against conversions between string types and uint vectors with different encodings. I don't see what's changed.

stevengj · 2015-10-13T21:27:33Z

@tkelman, can you cite the relevant issue where this discussion took place?

We noticed a lot of cases where we were doing utf8(UTF16String(data)) and it would be better to avoid the intermediate allocation of a UTF16String.

tkelman · 2015-10-14T06:02:55Z

It was during one or more of the big messy unicode overhaul PR's IIRC. UTF16String(data::Vector{UInt16}) should be a cheap non-copying wrapper, shouldn't it?

ScottPJones · 2015-10-14T11:18:20Z

The problem (as I discussed back in the reviews of my PRs) with UTF16String, is that it requires a terminating 0x0000 word, forcing a copy just to add it in many cases.
The reason I had encode_to_utf8 and encode_to_utf16 functions was to handle the inconsistencies present in the Julia string support (visible \0 or not, UInt* or Char - at least the Char one has been fixed as I had suggested).
I don't recall who was opposed to having the methods that took a Vector{UInt16} or Vector{UInt32} (i.e without a trailing NUL word) besides @tkelman.
I still think, as this use case shows, that it is very important functionality, and should be in base
(we also are using it in our code, having to call the functions I added in Base, i.e. Base.encode_to_utf16 directly, since Tony removed the convert methods I had in my PR (which this PR seems to just re-add).

tkelman · 2015-10-14T12:32:16Z

If the underlying issue is trailing nuls, let's address the underlying issue now that we're in a dev period. A vector of unsigned integers is not the same thing as a string (the latter is usually represented by the former, but that is an implementation detail, not a fundamental invariant) and it's not a good idea IMO to define conversions that conflate encoding and storage by pretending otherwise. I seem to recall Jameson and a few others being of a similar opinion at the time.

StefanKarpinski · 2015-10-14T12:38:51Z

IMO, ensuring that trailing NUL bytes are present should be done on-demand when strings are passed to C. We now have the machinery in place to make that happen at the ccall entry point.

ScottPJones · 2015-10-14T12:54:58Z

@StefanKarpinski Yes, that was a very good addition. The problem now is to go through a lot of code in Base and packages, and ensure that any ccall's that really need a nul-terminated C string use Cstring or Cwstring instead of Ptr{UInt8}, Ptr{UInt16}.

stevengj · 2015-10-14T16:18:48Z

Actually, I think in all the cases where we have utf8(UTF16String(data)), the data is NUL-terminated, so UTF16String won't make a copy, and hence you're right that this PR is probably superfluous.

stevengj · 2015-10-14T16:20:26Z

@StefanKarpinski, we certainly have the machinery to add trailing NUL words when a string is passed as Cwstring. But this would entail making a copy of the string at every call site. It doesn't seem worth it.

malmaud · 2015-10-14T16:21:27Z

Alright, I'm just going to close this. If we want to have a separate PR that revisits null-terminated UInt16 considerations, that would make sense but it's probably not something I can take point on.

StefanKarpinski · 2015-10-14T20:06:11Z

@stevengj: my hypothesis is that every C API that takes NUL-terminated data only deals with small strings so the copying would have negligible impact. It would also be avoidable if the NUL already exists, so you could avoid it by just pre-converting to NUL-terminated form.

Define more conversions to UTF8String

6d0bb26

stevengj reviewed Oct 13, 2015
View reviewed changes

stevengj added the unicode Related to unicode characters and encodings label Oct 13, 2015

malmaud closed this Oct 14, 2015

tkelman deleted the jmm/utf8constructors branch October 14, 2015 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define more conversions to UTF8String #13588

Define more conversions to UTF8String #13588

malmaud commented Oct 13, 2015

stevengj Oct 13, 2015

tkelman Oct 14, 2015

stevengj commented Oct 13, 2015

tkelman commented Oct 13, 2015

stevengj commented Oct 13, 2015

tkelman commented Oct 14, 2015

ScottPJones commented Oct 14, 2015

tkelman commented Oct 14, 2015

StefanKarpinski commented Oct 14, 2015

ScottPJones commented Oct 14, 2015

stevengj commented Oct 14, 2015

stevengj commented Oct 14, 2015

malmaud commented Oct 14, 2015

StefanKarpinski commented Oct 14, 2015

Define more conversions to UTF8String #13588

Define more conversions to UTF8String #13588

Conversation

malmaud commented Oct 13, 2015

stevengj Oct 13, 2015

Choose a reason for hiding this comment

tkelman Oct 14, 2015

Choose a reason for hiding this comment

stevengj commented Oct 13, 2015

tkelman commented Oct 13, 2015

stevengj commented Oct 13, 2015

tkelman commented Oct 14, 2015

ScottPJones commented Oct 14, 2015

tkelman commented Oct 14, 2015

StefanKarpinski commented Oct 14, 2015

ScottPJones commented Oct 14, 2015

stevengj commented Oct 14, 2015

stevengj commented Oct 14, 2015

malmaud commented Oct 14, 2015

StefanKarpinski commented Oct 14, 2015