Fix #10959 bugs with UTF-32 conversions #11607

ScottPJones · 2015-06-07T09:29:58Z

Added new convert methods that use the check_string function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR #11551 and #11575

nalimilan · 2015-06-07T10:39:14Z

base/utf16.jl

+
+# Get rest of character ch from 3-byte UTF-8 sequence in dat
+@inline function get_utf8_3byte(dat, pos, ch)
+    @inbounds return ((ch & 0xf) << 12) | (UInt32(dat[pos-1] & 0x3f) << 6) | (dat[pos] & 0x3f)


Probably, the @inbounds declaration should be in the caller? From inside the function, you cannot be sure that the index is correct, and from the caller, you don't know that the function assumes this.

(Also, missing new line before next function.)

I didn't know I could do that... plus I'd intended those to be private functions, only called inside the conversion or validation code.

I'd intended those to be private functions, only called inside the conversion or validation code.

Yeah, but if you can make it safer, better do it. Heard of this recent story about the sector_div kernel macro doing things people wouldn't expect when they were not experts of that part of the code? http://lwn.net/Articles/645720/

Yep, I changed those, thanks...

tkelman · 2015-06-07T11:07:12Z

Just wondering, is there any really good reason we absolutely need a UTF32 type in base instead of as a package? Is it actually commonly used at all?

quinnj · 2015-06-07T14:03:20Z

@tkelman, if I remember correctly there are some OSX libraries that only deal with UTF32, (I know iODBC is that way as an example), so I think that was one of the motivations for including (plus the fact that we already support UTF8 and UTF16 and there are no other packages for different string encodings yet).

ScottPJones · 2015-06-07T18:37:07Z

Any library where the C compiler uses wchar_t is defined as char32_t... you absolutely need UTF32 support in Base.
Note, nothing fancy needs to be in Base, for UTF16 or UTF32, just basic string operations, conversions, and validation...

tkelman · 2015-06-07T19:52:37Z

Any library where the C compiler uses wchar_t is defined as char32_t... you absolutely need UTF32 support in Base.

Not if we don't use any of those libraries for any other code in base. My question is could you copy-paste all code related to UTF32 types, move it to a package, and have everything else continue to work equally well? My hypothesis is yes (grepping base and my ~/.julia for UTF32String shows very very few matches). And it would only be a disruptive change for packages that need to interface to such libraries, of which there don't look to be very many. If this type is rarely used in practice, I'm not sure it's worth hundreds of lines of specialized code to deal with it.

tknopp · 2015-06-07T20:00:15Z

This does not answer Tonys question:

Is it actually commonly used at all?

So, how many Julia packages use it?

Edit: Sorry the last response was not there when writing this.

ScottPJones · 2015-06-07T20:02:42Z

@tkelman I'm afraid that is totally off topic. UTF32String is currently in Base, and if you want to see about having moved out of base, you should raise an issue about it.
What does whether or not core code calls a library using wchar_t, have to do with the fact that being able to access C/C++ libraries is part of the core functionality of Julia?

tkelman · 2015-06-07T20:18:05Z

This PR is still nearly as large as the original one. It'll get smaller if the separate validation and UTF16 PR's get merged and this one gets rebased to not count those changes, but for the UTF32 type in particular the benefit vs code size tradeoff is really not self-evident.

But looking now at how the Cwstring type works, it would probably be messy and not worth trying to move UTF32String out of base. So nevermind on that, but this PR could still stand to aim for generality over verbosity in the code.

ScottPJones · 2015-06-07T21:31:18Z

@tkelman This isn't anything as large! Just two files modified, 246 net new lines, over half of those documentation, and 50 were testing, This PR is set up so that any review (of this part) can be done, without having to wait until first #11575 and then #11551 are merged in, and since I was told to split things into smaller, more manageable chunks.

ScottPJones · 2015-06-08T08:59:30Z

The updates to the comments are now in, as requested.

ScottPJones · 2015-06-09T01:45:35Z

Updated again, ready to go after #11575 and #11551 get merged (if they do!)

ScottPJones · 2015-07-01T05:01:30Z

OK, this has now been rebased since #11551 just got merged in. Please take a look! Thanks!

tkelman · 2015-07-01T10:50:58Z

base/utf32.jl

+### Throws:
+*   `UnicodeError`
+"
+function convert(::Type{UTF8String}, dat::Vector{UInt32})


this method doesn't belong here, it has all the same issues that convert(::Type{UTF8String}, dat::Vector{UInt16}) had in the other PR

ScottPJones · 2015-07-01T14:40:19Z

@tkelman Those have been removed, hopefully the added reinterpret and new variable won't slow things down. 0 lines saved, 2 one line convert with reinterpret removed, 1 line added to 2 methods.

tkelman · 2015-07-01T15:35:11Z

base/utf32.jl

+### Throws:
+*   `UnicodeError`
+"
+function convert(::Type{UTF8String}, chrs::Vector{Char})


seems as though this method belongs in utf8.jl, and convert(::Type{UTF16String}, chrs::Vector{Char}) belongs in utf16.jl ?

They depend on knowledge of UTF-32 encoding, hence here is best.

Doesn't sound like a good argument to me. Better organize code logically based on the types it acts on, rather than by required skills. You typically don't think "I'm good at UTF-32, I'm going to work on all places where it's used in Julia", but rather "I need to fix a bug with UTF8String, let's see where this code lives".

I didn't mean a human knowledge of UTF-32 encoding, I meant that logically, UTF32String is essentially a Vector{Char}, so the functions dealing with them seemed to belong here.
The functions to deal with these were here originally also, I didn't think I should change that organization. If you all feel that reorganization is necessary, that is outside the scope of this PR.

ScottPJones · 2015-07-01T22:37:35Z

This did pass 2 out of 3 on travis, something about prebuild rebuild expired on the one that failed, doesn't seem to have anything to do with the change.

tkelman · 2015-07-01T22:56:25Z

hm, assertion failure on osx travis Assertion failed: (bp), function jl_deserialize_value, file dump.c, line 1019. /Users/travis/build.sh: line 41: 10005 Abort trap: 6 /tmp/julia/bin/julia -J local.ji -e 'true'

ScottPJones · 2015-07-01T23:09:20Z

Do you think that has anything to do with the change? This had been having passing builds on all platforms, and there have been other changes I thought in the low-level code recently.
I'm testing on a Mac myself, and haven't seen any assertion failures.

ScottPJones · 2015-07-03T02:43:15Z

Have the issues with travis been resolved, so that the testing can be restarted on this?
The one test out of 4 that failed had to so with dealing with the .ji files, where there has been a lot of changes recently, it doesn't seem likely that it has anything to do with UTF-32 conversions.

ScottPJones · 2015-07-07T04:00:09Z

Bump. This has passed all tests, I've moved things how @nalimilan suggested, this would also add some more coverage for string testing.

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

…logical

ScottPJones · 2015-07-10T23:06:39Z

Ugh! This failed because spawn doesn't seem to be working on a JULIA_CPU_CORES == 1 build

ScottPJones · 2015-07-11T03:41:16Z

Ok, all green again!

ScottPJones · 2015-07-12T13:35:15Z

Bump: this has been essentially unchanged for over a month (just removal of a method, and moving two methods to utf8.jl and utf16.jl). The external problems causing Travis or Appveyor problems also seem to have been fixed.

ScottPJones · 2015-07-19T13:58:39Z

Bump: Any further issues with getting these bug fixes and increased testing in?

tkelman · 2015-07-19T14:38:37Z

You still never really addressed the initial feedback of having too many specialized code paths for a not-very-convincing benefit. Many many times it has been asked whether you had a representative application or stand-in skeleton of one that spends a substantial amount of time in conversion to or from UTF32, and whether you can accomplish the bug fixes in a generic way without needing separate methods for every possible combination.

ScottPJones · 2015-07-19T14:57:54Z

If you just search for UTF32String, you won't find that much, but there are many calls that use Cwstring.
It has already been discussed here that Cwstring is UTF32String for non-Windows platforms, which seems to me to make handling UTF32String correctly fairly important.
This already is about as generic as it could be, fixes bugs in code that is already in Base, and adds much needed unit testing to show that those bugs are indeed fixed.

jakebolewski · 2015-07-19T15:07:45Z

base/unicode/utf8.jl

+### Returns:
+* `UTF8String`
+"
+function encode_to_utf8{T<:Union{UInt16, UInt32}}(::Type{T}, dat, len)


@ScottPJones do you have plans to use this function somewhere? Otherwise why not include it as part of the convert method?

Nevermind, I see it is used in the uft32 conversion code.

jakebolewski · 2015-07-19T15:15:22Z

@tkelman I don't find that argument very convincing. If we are going to support multiple string types, there is unavoidable complexity in inter-converting between multiple representations.

lgtm

Fix #10959 bugs with UTF-32 conversions

tkelman · 2015-07-19T15:16:31Z

Fine (we've managed so far with quite a bit less complexity than this...), though the vector-to-string conversions with mismatched element types are breaking the abstraction and should be deleted. I also couldn't find a single instance of Cwstring in any package that I use, aside from Compat.

jakebolewski · 2015-07-19T15:18:24Z

Yes, but instead of the back and forth why don't we just fix it now and move on?

tkelman · 2015-07-19T15:21:27Z

I'm about to.

nalimilan reviewed Jun 7, 2015
View reviewed changes

tkelman added the unicode Related to unicode characters and encodings label Jun 7, 2015

ScottPJones force-pushed the spj/fixutf32 branch from ec34c71 to ce55f3f Compare June 8, 2015 07:53

ScottPJones force-pushed the spj/fixutf32 branch from ce55f3f to 183074e Compare June 8, 2015 23:54

ScottPJones mentioned this pull request Jun 9, 2015

Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

Merged

ScottPJones force-pushed the spj/fixutf32 branch from 183074e to 19e853d Compare June 11, 2015 15:49

ScottPJones mentioned this pull request Jun 11, 2015

Add UTF encoding validity functions #11575

Merged

ScottPJones force-pushed the spj/fixutf32 branch 2 times, most recently from f33fd4d to 4afd98e Compare June 14, 2015 17:16

ScottPJones mentioned this pull request Jun 15, 2015

"invalid UTF-8 character index" error in writetable JuliaData/DataFrames.jl#813

Closed

ScottPJones force-pushed the spj/fixutf32 branch from 4afd98e to 1f1a6fd Compare June 22, 2015 21:42

ScottPJones force-pushed the spj/fixutf32 branch 2 times, most recently from a9a5d4b to 000e7ae Compare July 1, 2015 04:59

tkelman reviewed Jul 1, 2015
View reviewed changes

ScottPJones force-pushed the spj/fixutf32 branch from 725f9f4 to 8248581 Compare July 1, 2015 15:56

ScottPJones force-pushed the spj/fixutf32 branch from 8248581 to b6049b8 Compare July 6, 2015 21:46

ScottPJones added 4 commits July 10, 2015 15:55

Fix JuliaLang#10959 UTF-32 conversion errors

f2b83a2

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

c09d9e7

Remove conversions from Vector{UInt32}

cab2e4c

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more …

4424f42

…logical

ScottPJones force-pushed the spj/fixutf32 branch from b6049b8 to 4424f42 Compare July 10, 2015 20:04

jakebolewski reviewed Jul 19, 2015
View reviewed changes

jakebolewski added a commit that referenced this pull request Jul 19, 2015

Merge pull request #11607 from ScottPJones/spj/fixutf32

c08b1bb

Fix #10959 bugs with UTF-32 conversions

jakebolewski merged commit c08b1bb into JuliaLang:master Jul 19, 2015

Fix #10959 bugs with UTF-32 conversions #11607

Fix #10959 bugs with UTF-32 conversions #11607

Conversation

ScottPJones commented Jun 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Jun 7, 2015

quinnj commented Jun 7, 2015

ScottPJones commented Jun 7, 2015

tkelman commented Jun 7, 2015

tknopp commented Jun 7, 2015

ScottPJones commented Jun 7, 2015

tkelman commented Jun 7, 2015

ScottPJones commented Jun 7, 2015

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 9, 2015

ScottPJones commented Jul 1, 2015

Choose a reason for hiding this comment

ScottPJones commented Jul 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Jul 1, 2015

tkelman commented Jul 1, 2015

ScottPJones commented Jul 1, 2015

ScottPJones commented Jul 3, 2015

ScottPJones commented Jul 7, 2015

ScottPJones commented Jul 10, 2015

ScottPJones commented Jul 11, 2015

ScottPJones commented Jul 12, 2015

ScottPJones commented Jul 19, 2015

tkelman commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakebolewski commented Jul 19, 2015

tkelman commented Jul 19, 2015

jakebolewski commented Jul 19, 2015

tkelman commented Jul 19, 2015