-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #10959 bugs with UTF-32 conversions #11607
Conversation
|
||
# Get rest of character ch from 3-byte UTF-8 sequence in dat | ||
@inline function get_utf8_3byte(dat, pos, ch) | ||
@inbounds return ((ch & 0xf) << 12) | (UInt32(dat[pos-1] & 0x3f) << 6) | (dat[pos] & 0x3f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, the @inbounds
declaration should be in the caller? From inside the function, you cannot be sure that the index is correct, and from the caller, you don't know that the function assumes this.
(Also, missing new line before next function.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know I could do that... plus I'd intended those to be private functions, only called inside the conversion or validation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd intended those to be private functions, only called inside the conversion or validation code.
Yeah, but if you can make it safer, better do it. Heard of this recent story about the sector_div
kernel macro doing things people wouldn't expect when they were not experts of that part of the code? http://lwn.net/Articles/645720/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I changed those, thanks...
Just wondering, is there any really good reason we absolutely need a UTF32 type in base instead of as a package? Is it actually commonly used at all? |
@tkelman, if I remember correctly there are some OSX libraries that only deal with UTF32, (I know iODBC is that way as an example), so I think that was one of the motivations for including (plus the fact that we already support UTF8 and UTF16 and there are no other packages for different string encodings yet). |
Any library where the C compiler uses wchar_t is defined as char32_t... you absolutely need UTF32 support in Base. |
Not if we don't use any of those libraries for any other code in base. My question is could you copy-paste all code related to UTF32 types, move it to a package, and have everything else continue to work equally well? My hypothesis is yes (grepping base and my |
This does not answer Tonys question:
So, how many Julia packages use it? Edit: Sorry the last response was not there when writing this. |
@tkelman I'm afraid that is totally off topic. |
This PR is still nearly as large as the original one. It'll get smaller if the separate validation and UTF16 PR's get merged and this one gets rebased to not count those changes, but for the UTF32 type in particular the benefit vs code size tradeoff is really not self-evident. But looking now at how the |
@tkelman This isn't anything as large! Just two files modified, 246 net new lines, over half of those documentation, and 50 were testing, This PR is set up so that any review (of this part) can be done, without having to wait until first #11575 and then #11551 are merged in, and since I was told to split things into smaller, more manageable chunks. |
The updates to the comments are now in, as requested. |
183074e
to
19e853d
Compare
f33fd4d
to
4afd98e
Compare
4afd98e
to
1f1a6fd
Compare
a9a5d4b
to
000e7ae
Compare
OK, this has now been rebased since #11551 just got merged in. Please take a look! Thanks! |
### Throws: | ||
* `UnicodeError` | ||
" | ||
function convert(::Type{UTF8String}, dat::Vector{UInt32}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method doesn't belong here, it has all the same issues that convert(::Type{UTF8String}, dat::Vector{UInt16})
had in the other PR
@tkelman Those have been removed, hopefully the added reinterpret and new variable won't slow things down. 0 lines saved, 2 one line convert with reinterpret removed, 1 line added to 2 methods. |
### Throws: | ||
* `UnicodeError` | ||
" | ||
function convert(::Type{UTF8String}, chrs::Vector{Char}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems as though this method belongs in utf8.jl
, and convert(::Type{UTF16String}, chrs::Vector{Char})
belongs in utf16.jl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They depend on knowledge of UTF-32 encoding, hence here is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't sound like a good argument to me. Better organize code logically based on the types it acts on, rather than by required skills. You typically don't think "I'm good at UTF-32, I'm going to work on all places where it's used in Julia", but rather "I need to fix a bug with UTF8String
, let's see where this code lives".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean a human knowledge of UTF-32 encoding, I meant that logically, UTF32String is essentially a Vector{Char}, so the functions dealing with them seemed to belong here.
The functions to deal with these were here originally also, I didn't think I should change that organization. If you all feel that reorganization is necessary, that is outside the scope of this PR.
This did pass 2 out of 3 on travis, something about prebuild rebuild expired on the one that failed, doesn't seem to have anything to do with the change. |
hm, assertion failure on osx travis |
Do you think that has anything to do with the change? This had been having passing builds on all platforms, and there have been other changes I thought in the low-level code recently. |
Have the issues with travis been resolved, so that the testing can be restarted on this? |
Bump. This has passed all tests, I've moved things how @nalimilan suggested, this would also add some more coverage for string testing. |
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
b6049b8
to
4424f42
Compare
Ugh! This failed because spawn doesn't seem to be working on a JULIA_CPU_CORES == 1 build |
Ok, all green again! |
Bump: this has been essentially unchanged for over a month (just removal of a method, and moving two methods to utf8.jl and utf16.jl). The external problems causing Travis or Appveyor problems also seem to have been fixed. |
Bump: Any further issues with getting these bug fixes and increased testing in? |
You still never really addressed the initial feedback of having too many specialized code paths for a not-very-convincing benefit. Many many times it has been asked whether you had a representative application or stand-in skeleton of one that spends a substantial amount of time in conversion to or from UTF32, and whether you can accomplish the bug fixes in a generic way without needing separate methods for every possible combination. |
If you just search for |
### Returns: | ||
* `UTF8String` | ||
" | ||
function encode_to_utf8{T<:Union{UInt16, UInt32}}(::Type{T}, dat, len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottPJones do you have plans to use this function somewhere? Otherwise why not include it as part of the convert method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see it is used in the uft32 conversion code.
@tkelman I don't find that argument very convincing. If we are going to support multiple string types, there is unavoidable complexity in inter-converting between multiple representations. lgtm |
Fix #10959 bugs with UTF-32 conversions
Fine (we've managed so far with quite a bit less complexity than this...), though the vector-to-string conversions with mismatched element types are breaking the abstraction and should be deleted. I also couldn't find a single instance of |
Yes, but instead of the back and forth why don't we just fix it now and move on? |
I'm about to. |
Added new
convert
methods that use thecheck_string
function to validate inputAdded tests for many sorts of valid/invalid data
Depends on PR #11551 and #11575