Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #10959 bugs with UTF-32 conversions #11607

Merged
merged 4 commits into from
Jul 19, 2015

Conversation

ScottPJones
Copy link
Contributor

Added new convert methods that use the check_string function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR #11551 and #11575


# Get rest of character ch from 3-byte UTF-8 sequence in dat
@inline function get_utf8_3byte(dat, pos, ch)
@inbounds return ((ch & 0xf) << 12) | (UInt32(dat[pos-1] & 0x3f) << 6) | (dat[pos] & 0x3f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, the @inbounds declaration should be in the caller? From inside the function, you cannot be sure that the index is correct, and from the caller, you don't know that the function assumes this.

(Also, missing new line before next function.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know I could do that... plus I'd intended those to be private functions, only called inside the conversion or validation code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd intended those to be private functions, only called inside the conversion or validation code.

Yeah, but if you can make it safer, better do it. Heard of this recent story about the sector_div kernel macro doing things people wouldn't expect when they were not experts of that part of the code? http://lwn.net/Articles/645720/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I changed those, thanks...

@tkelman tkelman added the unicode Related to unicode characters and encodings label Jun 7, 2015
@tkelman
Copy link
Contributor

tkelman commented Jun 7, 2015

Just wondering, is there any really good reason we absolutely need a UTF32 type in base instead of as a package? Is it actually commonly used at all?

@quinnj
Copy link
Member

quinnj commented Jun 7, 2015

@tkelman, if I remember correctly there are some OSX libraries that only deal with UTF32, (I know iODBC is that way as an example), so I think that was one of the motivations for including (plus the fact that we already support UTF8 and UTF16 and there are no other packages for different string encodings yet).

@ScottPJones
Copy link
Contributor Author

Any library where the C compiler uses wchar_t is defined as char32_t... you absolutely need UTF32 support in Base.
Note, nothing fancy needs to be in Base, for UTF16 or UTF32, just basic string operations, conversions, and validation...

@tkelman
Copy link
Contributor

tkelman commented Jun 7, 2015

Any library where the C compiler uses wchar_t is defined as char32_t... you absolutely need UTF32 support in Base.

Not if we don't use any of those libraries for any other code in base. My question is could you copy-paste all code related to UTF32 types, move it to a package, and have everything else continue to work equally well? My hypothesis is yes (grepping base and my ~/.julia for UTF32String shows very very few matches). And it would only be a disruptive change for packages that need to interface to such libraries, of which there don't look to be very many. If this type is rarely used in practice, I'm not sure it's worth hundreds of lines of specialized code to deal with it.

@tknopp
Copy link
Contributor

tknopp commented Jun 7, 2015

This does not answer Tonys question:

Is it actually commonly used at all?

So, how many Julia packages use it?

Edit: Sorry the last response was not there when writing this.

@ScottPJones
Copy link
Contributor Author

@tkelman I'm afraid that is totally off topic. UTF32String is currently in Base, and if you want to see about having moved out of base, you should raise an issue about it.
What does whether or not core code calls a library using wchar_t, have to do with the fact that being able to access C/C++ libraries is part of the core functionality of Julia?

@tkelman
Copy link
Contributor

tkelman commented Jun 7, 2015

This PR is still nearly as large as the original one. It'll get smaller if the separate validation and UTF16 PR's get merged and this one gets rebased to not count those changes, but for the UTF32 type in particular the benefit vs code size tradeoff is really not self-evident.

But looking now at how the Cwstring type works, it would probably be messy and not worth trying to move UTF32String out of base. So nevermind on that, but this PR could still stand to aim for generality over verbosity in the code.

@ScottPJones
Copy link
Contributor Author

@tkelman This isn't anything as large! Just two files modified, 246 net new lines, over half of those documentation, and 50 were testing, This PR is set up so that any review (of this part) can be done, without having to wait until first #11575 and then #11551 are merged in, and since I was told to split things into smaller, more manageable chunks.

@ScottPJones
Copy link
Contributor Author

The updates to the comments are now in, as requested.

@ScottPJones
Copy link
Contributor Author

Updated again, ready to go after #11575 and #11551 get merged (if they do!)

@ScottPJones
Copy link
Contributor Author

OK, this has now been rebased since #11551 just got merged in. Please take a look! Thanks!

### Throws:
* `UnicodeError`
"
function convert(::Type{UTF8String}, dat::Vector{UInt32})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method doesn't belong here, it has all the same issues that convert(::Type{UTF8String}, dat::Vector{UInt16}) had in the other PR

@ScottPJones
Copy link
Contributor Author

@tkelman Those have been removed, hopefully the added reinterpret and new variable won't slow things down. 0 lines saved, 2 one line convert with reinterpret removed, 1 line added to 2 methods.

### Throws:
* `UnicodeError`
"
function convert(::Type{UTF8String}, chrs::Vector{Char})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems as though this method belongs in utf8.jl, and convert(::Type{UTF16String}, chrs::Vector{Char}) belongs in utf16.jl ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They depend on knowledge of UTF-32 encoding, hence here is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't sound like a good argument to me. Better organize code logically based on the types it acts on, rather than by required skills. You typically don't think "I'm good at UTF-32, I'm going to work on all places where it's used in Julia", but rather "I need to fix a bug with UTF8String, let's see where this code lives".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean a human knowledge of UTF-32 encoding, I meant that logically, UTF32String is essentially a Vector{Char}, so the functions dealing with them seemed to belong here.
The functions to deal with these were here originally also, I didn't think I should change that organization. If you all feel that reorganization is necessary, that is outside the scope of this PR.

@ScottPJones
Copy link
Contributor Author

This did pass 2 out of 3 on travis, something about prebuild rebuild expired on the one that failed, doesn't seem to have anything to do with the change.

@tkelman
Copy link
Contributor

tkelman commented Jul 1, 2015

hm, assertion failure on osx travis Assertion failed: (bp), function jl_deserialize_value, file dump.c, line 1019. /Users/travis/build.sh: line 41: 10005 Abort trap: 6 /tmp/julia/bin/julia -J local.ji -e 'true'

@ScottPJones
Copy link
Contributor Author

Do you think that has anything to do with the change? This had been having passing builds on all platforms, and there have been other changes I thought in the low-level code recently.
I'm testing on a Mac myself, and haven't seen any assertion failures.

@ScottPJones
Copy link
Contributor Author

Have the issues with travis been resolved, so that the testing can be restarted on this?
The one test out of 4 that failed had to so with dealing with the .ji files, where there has been a lot of changes recently, it doesn't seem likely that it has anything to do with UTF-32 conversions.

@ScottPJones
Copy link
Contributor Author

Bump. This has passed all tests, I've moved things how @nalimilan suggested, this would also add some more coverage for string testing.

Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
@ScottPJones
Copy link
Contributor Author

Ugh! This failed because spawn doesn't seem to be working on a JULIA_CPU_CORES == 1 build

@ScottPJones
Copy link
Contributor Author

Ok, all green again!

@ScottPJones
Copy link
Contributor Author

Bump: this has been essentially unchanged for over a month (just removal of a method, and moving two methods to utf8.jl and utf16.jl). The external problems causing Travis or Appveyor problems also seem to have been fixed.

@ScottPJones
Copy link
Contributor Author

Bump: Any further issues with getting these bug fixes and increased testing in?

@tkelman
Copy link
Contributor

tkelman commented Jul 19, 2015

You still never really addressed the initial feedback of having too many specialized code paths for a not-very-convincing benefit. Many many times it has been asked whether you had a representative application or stand-in skeleton of one that spends a substantial amount of time in conversion to or from UTF32, and whether you can accomplish the bug fixes in a generic way without needing separate methods for every possible combination.

@ScottPJones
Copy link
Contributor Author

If you just search for UTF32String, you won't find that much, but there are many calls that use Cwstring.
It has already been discussed here that Cwstring is UTF32String for non-Windows platforms, which seems to me to make handling UTF32String correctly fairly important.
This already is about as generic as it could be, fixes bugs in code that is already in Base, and adds much needed unit testing to show that those bugs are indeed fixed.

### Returns:
* `UTF8String`
"
function encode_to_utf8{T<:Union{UInt16, UInt32}}(::Type{T}, dat, len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottPJones do you have plans to use this function somewhere? Otherwise why not include it as part of the convert method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see it is used in the uft32 conversion code.

@jakebolewski
Copy link
Member

@tkelman I don't find that argument very convincing. If we are going to support multiple string types, there is unavoidable complexity in inter-converting between multiple representations.

lgtm

jakebolewski added a commit that referenced this pull request Jul 19, 2015
@jakebolewski jakebolewski merged commit c08b1bb into JuliaLang:master Jul 19, 2015
@tkelman
Copy link
Contributor

tkelman commented Jul 19, 2015

Fine (we've managed so far with quite a bit less complexity than this...), though the vector-to-string conversions with mismatched element types are breaking the abstraction and should be deleted. I also couldn't find a single instance of Cwstring in any package that I use, aside from Compat.

@jakebolewski
Copy link
Member

Yes, but instead of the back and forth why don't we just fix it now and move on?

@tkelman
Copy link
Contributor

tkelman commented Jul 19, 2015

I'm about to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants