Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

ScottPJones · 2015-06-09T01:21:23Z

This change is based off of #11575, #11551, #11607.
It uses the new more generic function is_valid_continuation instead of is_utf8_continuation and is_utf8_start, which only worked on UInt8 values.
It fixes the way a Vector{UInt8} gets converted to a UTF8String, by calling unsafe_checkstring,
and dealing with things like overly long encodings (which happen in Modified UTF-8 used by Java and other systems, and CESU-8 used by Oracle, MySQL and others).

ScottPJones · 2015-07-09T01:28:32Z

Again, another failure that seems totally unrelated (this time on Appveyor).

tkelman · 2015-07-10T07:27:28Z

marking as WIP until #11607 gets merged

tkelman · 2015-07-19T14:49:32Z

Can this be changed to be independent of #11607? UTF8 is much more commonly used and I suspect people are more convinced of the benefit vs code size tradeoff of just ca87916 by itself (would be fine to squash with the relevant parts of e8b0ba8 if you want).

jakebolewski · 2015-07-19T16:02:22Z

Bump. It would be great to address @tkelman's comments here and rebase so we can get this merged.

ScottPJones · 2015-07-19T17:37:15Z

Yes, I'm in the process of doing so (but am at the beach with my family!)

jakebolewski · 2015-07-19T18:12:01Z

Get off the computer and enjoy the beach! It would be nice to get your work in now that the 0.4 window is closing, so it would be nice to wrap this up sometime in the next week.

ScottPJones · 2015-07-19T18:40:14Z

Hehe, pushed now, as soon as Travis and Appveyor have there way with it, hopefully it can be merged! (check out my tweet of my weekend "office"!) 😀 https://twitter.com/gandalfsoftware/status/622836838164787201

ScottPJones · 2015-07-19T19:52:35Z

Bump: tests passed, hopefully all ready now.

ScottPJones · 2015-07-23T15:33:28Z

Bump: ready to merge? Anything else I need to do? (want to start moving this to 100% coverage also)

StefanKarpinski · 2015-07-23T15:42:58Z

Why does this PR have so many unrelated changes in it?

ScottPJones · 2015-07-23T15:54:52Z

What is unrelated? There are 3 things here:

new version of convert function that fixes bugs
new tests to show correctness of fixes
minor change to documentation, " -> """, since Nolta nicely added triplequoting to the parser.

StefanKarpinski · 2015-07-23T19:02:48Z

All the triple quotes for one thing – that would be better in a separate PR, which could be merged right away since it's obviously an ok change. It's unclear if the change to using is_valid_continuation is part of this change or not. Part of the reason it's unclear is because there's no explanation in the commit message and other changes that are obviously not part of this commit, so who knows?

ScottPJones · 2015-07-24T00:31:36Z

@StefanKarpinski Does this look good now? I separated out the " to """ also, it is in #12287.

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

ScottPJones · 2015-07-28T18:43:27Z

Bump: anything else that needs changing? Thanks!

Fix #10959, fix #11463 bugs with UTF-8 conversions

ScottPJones · 2015-07-28T19:38:06Z

Thanks!

JeffBezanson · 2015-07-28T19:49:28Z

I'm fine with merging this, but I have a couple questions. Apologies if these have already been discussed elsewhere:

Do we really need to support CESU-8? The unicode tech report on it says it should only be used for internal processing, and not interchange.
The convert from byte vector to string that takes invalids_as now seems to be significantly different from the the convert method without that argument. Is that ok? Seems confusing to me.

I don't understand the CESU-8 code. Here's an example:

julia> convert(UTF8String, utf8(Char[0xd800,0xdd01]).data)
"\U1ff401"

I tried to produce the CESU-8 encoding of \U10101. Did I do it right?

ScottPJones · 2015-07-28T20:22:24Z

No problem with the questions!

Both Modified UTF-8 (used by Java and many others) and CESU-8 (used by Oracle, MySQL, and others) are important to be able to handle for input only. They are not considered "valid" by isvalid,
and the new convert functions only return valid ASCII, UTF8, UTF16, or UTF32 strings. In my past, I saw many cases of both of those variations, but I did make it possible (at least with checkstring for now) to only accept 100% valid UTF encoded strings.
Right, I believe way back at the beginning, I pointed out the convert with invalids_as as not being consistent (even before any of my changes), as it was only implemented for UTF8String, and not for UTF16String and UTF32String, and also that overly long encodings could get converted to the "invalids_as" string, instead of the intended value. Making a convert that also lets you specify how to handle long encodings and invalid values is definitely on my list of Julia string improvements.
Any suggestions as to the best Julian way of doing that? I don't think a positional argument is good, but whatever you think is best. (you may see that for checkstring, I segregated different types of long encodings, so that you might allow Modified UTF-8, or CESU-8, but not arbitrary long encodings).
The last looks like a bug crept in! I'll fix that as soon as I get back from the beach. Thanks for catching that! 😳

JeffBezanson · 2015-07-28T20:28:09Z

Ok, thanks. I agree a positional argument for invalids_as is not ideal.

ScottPJones · 2015-07-29T02:41:28Z

See #12360 and #12358 for the bug fix, and for further discussion of how to best handle invalids

coveralls · 2017-03-28T18:02:35Z

Changes Unknown when pulling 91305f7 on ScottPJones:spj/fixutf8 into ** on JuliaLang:master**.

tkelman added the unicode Related to unicode characters and encodings label Jun 9, 2015

ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from f65c6df to dde4ee4 Compare June 11, 2015 17:36

ScottPJones mentioned this pull request Jun 11, 2015

Add UTF encoding validity functions #11575

Merged

ScottPJones force-pushed the spj/fixutf8 branch 3 times, most recently from 44a3343 to 6cb1ed2 Compare June 14, 2015 15:02

ScottPJones mentioned this pull request Jun 15, 2015

"invalid UTF-8 character index" error in writetable JuliaData/DataFrames.jl#813

Closed

ScottPJones changed the title ~~Fix #10959 bugs with UTF-8 conversions~~ Fix #10959, fix #11463 bugs with UTF-8 conversions Jun 16, 2015

ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 19f575c to 6e4087e Compare June 22, 2015 22:42

ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from ce5a4e6 to f0e3086 Compare July 1, 2015 13:57

ScottPJones force-pushed the spj/fixutf8 branch from f0e3086 to 4af9f8f Compare July 9, 2015 00:20

ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 46aece2 to 48cf84f Compare July 9, 2015 20:53

tkelman changed the title ~~Fix #10959, fix #11463 bugs with UTF-8 conversions~~ WIP: Fix #10959, fix #11463 bugs with UTF-8 conversions Jul 10, 2015

ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 9dce7f5 to e8b0ba8 Compare July 12, 2015 14:07

ScottPJones force-pushed the spj/fixutf8 branch from e8b0ba8 to e71378c Compare July 19, 2015 18:34

tkelman changed the title ~~WIP: Fix #10959, fix #11463 bugs with UTF-8 conversions~~ Fix #10959, fix #11463 bugs with UTF-8 conversions Jul 21, 2015

ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 9466c6d to 2fca588 Compare July 24, 2015 00:24

ScottPJones force-pushed the spj/fixutf8 branch from 2fca588 to 37650ef Compare July 27, 2015 23:38

Fix JuliaLang#10959, fix JuliaLang#11463 bugs with UTF-8 conversions

91305f7

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

ScottPJones force-pushed the spj/fixutf8 branch from 37650ef to 91305f7 Compare July 28, 2015 16:01

StefanKarpinski added a commit that referenced this pull request Jul 28, 2015

Merge pull request #11624 from ScottPJones/spj/fixutf8

416a23e

Fix #10959, fix #11463 bugs with UTF-8 conversions

StefanKarpinski merged commit 416a23e into JuliaLang:master Jul 28, 2015

ScottPJones deleted the spj/fixutf8 branch July 28, 2015 19:38

ScottPJones mentioned this pull request Jul 29, 2015

Fix a bug handling CESU-8 strings in convert(UTF8String, Vector{UInt8} #12360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

ScottPJones commented Jun 9, 2015

ScottPJones commented Jul 9, 2015

tkelman commented Jul 10, 2015

tkelman commented Jul 19, 2015

jakebolewski commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

jakebolewski commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

ScottPJones commented Jul 23, 2015

StefanKarpinski commented Jul 23, 2015

ScottPJones commented Jul 23, 2015

StefanKarpinski commented Jul 23, 2015

ScottPJones commented Jul 24, 2015

ScottPJones commented Jul 28, 2015

ScottPJones commented Jul 28, 2015

JeffBezanson commented Jul 28, 2015

ScottPJones commented Jul 28, 2015

JeffBezanson commented Jul 28, 2015

ScottPJones commented Jul 29, 2015

coveralls commented Mar 28, 2017

Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

Conversation

ScottPJones commented Jun 9, 2015

ScottPJones commented Jul 9, 2015

tkelman commented Jul 10, 2015

tkelman commented Jul 19, 2015

jakebolewski commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

jakebolewski commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

ScottPJones commented Jul 19, 2015

ScottPJones commented Jul 23, 2015

StefanKarpinski commented Jul 23, 2015

ScottPJones commented Jul 23, 2015

StefanKarpinski commented Jul 23, 2015

ScottPJones commented Jul 24, 2015

ScottPJones commented Jul 28, 2015

ScottPJones commented Jul 28, 2015

JeffBezanson commented Jul 28, 2015

ScottPJones commented Jul 28, 2015

JeffBezanson commented Jul 28, 2015

ScottPJones commented Jul 29, 2015

coveralls commented Mar 28, 2017