-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define more conversions to UTF8String #13588
Conversation
@@ -312,6 +312,11 @@ function convert(::Type{UTF8String}, a::Vector{UInt8}, invalids_as::AbstractStri | |||
end | |||
convert(::Type{UTF8String}, s::AbstractString) = utf8(bytestring(s)) | |||
|
|||
function convert{T<:Union{UInt16, UInt32}}(::Type{UTF8String}, a::Vector{T}, len=length(a)-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf16(a::Vector{UInt16})
does not assume that a
is NUL-terminated (if a
ends with a NUL codeunit, then the resulting string contains a NUL). So, it would be more consistent to use len=length(a)
here, even if that means you need to do UTF8String(buf, length(buf)-1)
at most of the call sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a bigger change, but would now be the right time to revisit the nul termination of utf16 and utf32 strings?
Needs a test case, and probably a doc update since this is a user-visible function. |
Last time around we argued pretty strongly against conversions between string types and uint vectors with different encodings. I don't see what's changed. |
@tkelman, can you cite the relevant issue where this discussion took place? We noticed a lot of cases where we were doing |
It was during one or more of the big messy unicode overhaul PR's IIRC. |
The problem (as I discussed back in the reviews of my PRs) with UTF16String, is that it requires a terminating 0x0000 word, forcing a copy just to add it in many cases. |
If the underlying issue is trailing nuls, let's address the underlying issue now that we're in a dev period. A vector of unsigned integers is not the same thing as a string (the latter is usually represented by the former, but that is an implementation detail, not a fundamental invariant) and it's not a good idea IMO to define conversions that conflate encoding and storage by pretending otherwise. I seem to recall Jameson and a few others being of a similar opinion at the time. |
IMO, ensuring that trailing NUL bytes are present should be done on-demand when strings are passed to C. We now have the machinery in place to make that happen at the ccall entry point. |
@StefanKarpinski Yes, that was a very good addition. The problem now is to go through a lot of code in Base and packages, and ensure that any ccall's that really need a nul-terminated C string use Cstring or Cwstring instead of Ptr{UInt8}, Ptr{UInt16}. |
Actually, I think in all the cases where we have |
@StefanKarpinski, we certainly have the machinery to add trailing NUL words when a string is passed as |
Alright, I'm just going to close this. If we want to have a separate PR that revisits null-terminated UInt16 considerations, that would make sense but it's probably not something I can take point on. |
@stevengj: my hypothesis is that every C API that takes NUL-terminated data only deals with small strings so the copying would have negligible impact. It would also be avoidable if the NUL already exists, so you could avoid it by just pre-converting to NUL-terminated form. |
cf #13582 (comment)