Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #10959, fix #11463 bugs with UTF-8 conversions #11624

Merged
merged 1 commit into from
Jul 28, 2015

Conversation

ScottPJones
Copy link
Contributor

This change is based off of #11575, #11551, #11607.
It uses the new more generic function is_valid_continuation instead of is_utf8_continuation and is_utf8_start, which only worked on UInt8 values.
It fixes the way a Vector{UInt8} gets converted to a UTF8String, by calling unsafe_checkstring,
and dealing with things like overly long encodings (which happen in Modified UTF-8 used by Java and other systems, and CESU-8 used by Oracle, MySQL and others).

@tkelman tkelman added the unicode Related to unicode characters and encodings label Jun 9, 2015
@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from f65c6df to dde4ee4 Compare June 11, 2015 17:36
@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 3 times, most recently from 44a3343 to 6cb1ed2 Compare June 14, 2015 15:02
@ScottPJones ScottPJones changed the title Fix #10959 bugs with UTF-8 conversions Fix #10959, fix #11463 bugs with UTF-8 conversions Jun 16, 2015
@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 19f575c to 6e4087e Compare June 22, 2015 22:42
@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from ce5a4e6 to f0e3086 Compare July 1, 2015 13:57
@ScottPJones
Copy link
Contributor Author

Again, another failure that seems totally unrelated (this time on Appveyor).

@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 46aece2 to 48cf84f Compare July 9, 2015 20:53
@tkelman tkelman changed the title Fix #10959, fix #11463 bugs with UTF-8 conversions WIP: Fix #10959, fix #11463 bugs with UTF-8 conversions Jul 10, 2015
@tkelman
Copy link
Contributor

tkelman commented Jul 10, 2015

marking as WIP until #11607 gets merged

@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 9dce7f5 to e8b0ba8 Compare July 12, 2015 14:07
@tkelman
Copy link
Contributor

tkelman commented Jul 19, 2015

Can this be changed to be independent of #11607? UTF8 is much more commonly used and I suspect people are more convinced of the benefit vs code size tradeoff of just ca87916 by itself (would be fine to squash with the relevant parts of e8b0ba8 if you want).

@jakebolewski
Copy link
Member

Bump. It would be great to address @tkelman's comments here and rebase so we can get this merged.

@ScottPJones
Copy link
Contributor Author

Yes, I'm in the process of doing so (but am at the beach with my family!)

@jakebolewski
Copy link
Member

Get off the computer and enjoy the beach! It would be nice to get your work in now that the 0.4 window is closing, so it would be nice to wrap this up sometime in the next week.

@ScottPJones
Copy link
Contributor Author

Hehe, pushed now, as soon as Travis and Appveyor have there way with it, hopefully it can be merged! (check out my tweet of my weekend "office"!) 😀 https://twitter.com/gandalfsoftware/status/622836838164787201

@ScottPJones
Copy link
Contributor Author

Bump: tests passed, hopefully all ready now.

@tkelman tkelman changed the title WIP: Fix #10959, fix #11463 bugs with UTF-8 conversions Fix #10959, fix #11463 bugs with UTF-8 conversions Jul 21, 2015
@ScottPJones
Copy link
Contributor Author

Bump: ready to merge? Anything else I need to do? (want to start moving this to 100% coverage also)

@StefanKarpinski
Copy link
Member

Why does this PR have so many unrelated changes in it?

@ScottPJones
Copy link
Contributor Author

What is unrelated? There are 3 things here:

  1. new version of convert function that fixes bugs
  2. new tests to show correctness of fixes
  3. minor change to documentation, " -> """, since Nolta nicely added triplequoting to the parser.

@StefanKarpinski
Copy link
Member

All the triple quotes for one thing – that would be better in a separate PR, which could be merged right away since it's obviously an ok change. It's unclear if the change to using is_valid_continuation is part of this change or not. Part of the reason it's unclear is because there's no explanation in the commit message and other changes that are obviously not part of this commit, so who knows?

@ScottPJones ScottPJones force-pushed the spj/fixutf8 branch 2 times, most recently from 9466c6d to 2fca588 Compare July 24, 2015 00:24
@ScottPJones
Copy link
Contributor Author

@StefanKarpinski Does this look good now? I separated out the " to """ also, it is in #12287.

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
@ScottPJones
Copy link
Contributor Author

Bump: anything else that needs changing? Thanks!

StefanKarpinski added a commit that referenced this pull request Jul 28, 2015
@StefanKarpinski StefanKarpinski merged commit 416a23e into JuliaLang:master Jul 28, 2015
@ScottPJones
Copy link
Contributor Author

Thanks!

@ScottPJones ScottPJones deleted the spj/fixutf8 branch July 28, 2015 19:38
@JeffBezanson
Copy link
Member

I'm fine with merging this, but I have a couple questions. Apologies if these have already been discussed elsewhere:

  • Do we really need to support CESU-8? The unicode tech report on it says it should only be used for internal processing, and not interchange.
  • The convert from byte vector to string that takes invalids_as now seems to be significantly different from the the convert method without that argument. Is that ok? Seems confusing to me.

I don't understand the CESU-8 code. Here's an example:

julia> convert(UTF8String, utf8(Char[0xd800,0xdd01]).data)
"\U1ff401"

I tried to produce the CESU-8 encoding of \U10101. Did I do it right?

@ScottPJones
Copy link
Contributor Author

No problem with the questions!

  • Both Modified UTF-8 (used by Java and many others) and CESU-8 (used by Oracle, MySQL, and others) are important to be able to handle for input only. They are not considered "valid" by isvalid,
    and the new convert functions only return valid ASCII, UTF8, UTF16, or UTF32 strings. In my past, I saw many cases of both of those variations, but I did make it possible (at least with checkstring for now) to only accept 100% valid UTF encoded strings.
  • Right, I believe way back at the beginning, I pointed out the convert with invalids_as as not being consistent (even before any of my changes), as it was only implemented for UTF8String, and not for UTF16String and UTF32String, and also that overly long encodings could get converted to the "invalids_as" string, instead of the intended value. Making a convert that also lets you specify how to handle long encodings and invalid values is definitely on my list of Julia string improvements.
    Any suggestions as to the best Julian way of doing that? I don't think a positional argument is good, but whatever you think is best. (you may see that for checkstring, I segregated different types of long encodings, so that you might allow Modified UTF-8, or CESU-8, but not arbitrary long encodings).
  • The last looks like a bug crept in! I'll fix that as soon as I get back from the beach. Thanks for catching that! 😳

@JeffBezanson
Copy link
Member

Ok, thanks. I agree a positional argument for invalids_as is not ideal.

@ScottPJones
Copy link
Contributor Author

See #12360 and #12358 for the bug fix, and for further discussion of how to best handle invalids

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 91305f7 on ScottPJones:spj/fixutf8 into ** on JuliaLang:master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants