Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define more conversions to UTF8String #13588

Closed
wants to merge 1 commit into from
Closed

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented Oct 13, 2015

@@ -312,6 +312,11 @@ function convert(::Type{UTF8String}, a::Vector{UInt8}, invalids_as::AbstractStri
end
convert(::Type{UTF8String}, s::AbstractString) = utf8(bytestring(s))

function convert{T<:Union{UInt16, UInt32}}(::Type{UTF8String}, a::Vector{T}, len=length(a)-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf16(a::Vector{UInt16}) does not assume that a is NUL-terminated (if a ends with a NUL codeunit, then the resulting string contains a NUL). So, it would be more consistent to use len=length(a) here, even if that means you need to do UTF8String(buf, length(buf)-1) at most of the call sites.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a bigger change, but would now be the right time to revisit the nul termination of utf16 and utf32 strings?

@stevengj
Copy link
Member

Needs a test case, and probably a doc update since this is a user-visible function.

@stevengj stevengj added the unicode Related to unicode characters and encodings label Oct 13, 2015
@tkelman
Copy link
Contributor

tkelman commented Oct 13, 2015

Last time around we argued pretty strongly against conversions between string types and uint vectors with different encodings. I don't see what's changed.

@stevengj
Copy link
Member

@tkelman, can you cite the relevant issue where this discussion took place?

We noticed a lot of cases where we were doing utf8(UTF16String(data)) and it would be better to avoid the intermediate allocation of a UTF16String.

@tkelman
Copy link
Contributor

tkelman commented Oct 14, 2015

It was during one or more of the big messy unicode overhaul PR's IIRC. UTF16String(data::Vector{UInt16}) should be a cheap non-copying wrapper, shouldn't it?

@ScottPJones
Copy link
Contributor

The problem (as I discussed back in the reviews of my PRs) with UTF16String, is that it requires a terminating 0x0000 word, forcing a copy just to add it in many cases.
The reason I had encode_to_utf8 and encode_to_utf16 functions was to handle the inconsistencies present in the Julia string support (visible \0 or not, UInt* or Char - at least the Char one has been fixed as I had suggested).
I don't recall who was opposed to having the methods that took a Vector{UInt16} or Vector{UInt32} (i.e without a trailing NUL word) besides @tkelman.
I still think, as this use case shows, that it is very important functionality, and should be in base
(we also are using it in our code, having to call the functions I added in Base, i.e. Base.encode_to_utf16 directly, since Tony removed the convert methods I had in my PR (which this PR seems to just re-add).

@tkelman
Copy link
Contributor

tkelman commented Oct 14, 2015

If the underlying issue is trailing nuls, let's address the underlying issue now that we're in a dev period. A vector of unsigned integers is not the same thing as a string (the latter is usually represented by the former, but that is an implementation detail, not a fundamental invariant) and it's not a good idea IMO to define conversions that conflate encoding and storage by pretending otherwise. I seem to recall Jameson and a few others being of a similar opinion at the time.

@StefanKarpinski
Copy link
Member

IMO, ensuring that trailing NUL bytes are present should be done on-demand when strings are passed to C. We now have the machinery in place to make that happen at the ccall entry point.

@ScottPJones
Copy link
Contributor

@StefanKarpinski Yes, that was a very good addition. The problem now is to go through a lot of code in Base and packages, and ensure that any ccall's that really need a nul-terminated C string use Cstring or Cwstring instead of Ptr{UInt8}, Ptr{UInt16}.

@stevengj
Copy link
Member

Actually, I think in all the cases where we have utf8(UTF16String(data)), the data is NUL-terminated, so UTF16String won't make a copy, and hence you're right that this PR is probably superfluous.

@stevengj
Copy link
Member

@StefanKarpinski, we certainly have the machinery to add trailing NUL words when a string is passed as Cwstring. But this would entail making a copy of the string at every call site. It doesn't seem worth it.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 14, 2015

Alright, I'm just going to close this. If we want to have a separate PR that revisits null-terminated UInt16 considerations, that would make sense but it's probably not something I can take point on.

@malmaud malmaud closed this Oct 14, 2015
@tkelman tkelman deleted the jmm/utf8constructors branch October 14, 2015 16:59
@StefanKarpinski
Copy link
Member

@stevengj: my hypothesis is that every C API that takes NUL-terminated data only deals with small strings so the copying would have negligible impact. It would also be avoidable if the NUL already exists, so you could avoid it by just pre-converting to NUL-terminated form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants