Bugs in Unicode handling with UTF8String #11463

ScottPJones · 2015-05-28T01:57:33Z

There are a number of bugs related to handling UTF-8 encoding in Julia.
A number of these are shown by the test routine in the following gist:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc
The types of errors are dealing with:

overlong encoded characters
missing continuation bytes at the end of a string
encoding of UTF-16 surrogates (where there is a surrogate pair, encoded as 2 3-byte sequences)
handling of invalid UTF-16 surrogates (either trailing first, or missing trailing)
characters outside legal Unicode range (i.e.> 0x10ffff)

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2015-05-28T02:16:23Z

Are the bugs all in converting between different encodings, or are you expecting the UTF8String constructor to throw errors? Currently it's intentional that the UTF8String constructor never throws errors. That comes from trying to be more robust when given invalid data. I'd be ok with changing this however.

ScottPJones · 2015-05-28T02:21:05Z

No, this is strictly with the convert and utf8 methods, just like the ones I'm trying to fix with UTF-16/32
(the never ending PR!).
I did bring up an issue that there's a nasty hole, in that there's no guarantee that strings are immutable, because you can keep an alias to the array handed to UTF*String and fiddle with it's private parts.
Maybe that's a 0.5 cleanup... (making the public UTF*String constructors safe, while leaving a backdoor for things like the utf* and convert methods to call... I have no idea what would be a good approach...)

JeffBezanson · 2015-05-28T02:41:39Z

I'm curious: is there some way you plan to make use of encoding-related errors at the application level? I've often seen data sets with a bit of corruption, or maybe UTF-8 with some Latin-1 mixed in, and it can be nice to just ignore the bad data and keep running. Granted this is pretty fast and loose, but recovering from the exceptions can be quite difficult.

ScottPJones · 2015-05-28T02:49:21Z

I noticed that there was a specific convert method (unfortunately, only for UTF-8, I believe), that has a 3rd argument, for a replacement character/string. There could also be a 4th argument, for options (like I have at the low level), for how invalid data is handled... (allow overlong nul, allow all overlong encodings, allow surrogate pairs in UTF-8 or UTF-32, use replacement character/string or throw error)
My one rule though, which I hope you won't ask me to break, is that all strings produced by the conversion routines should be strictly valid strings (UTF-8, UTF-16, or UTF-32). If somebody needs to produce "Java" style modified UTF-8 (i.e. '\0' as 0xc0 0x80), they can write a simple method that replaces all nul bytes with the overlong sequence.

Does that sound reasonable to you?

ScottPJones · 2015-05-28T02:52:17Z

Except for the fact that there is already a 3 arg convert [only for UTF8String], the options really should be the third (optional) argument, and the 4th, if present, be the replacement character/string [default to 0xfffd, as now, if the replace instead of throw error option is on in the 3rd argument].

garrison · 2015-05-28T06:10:13Z

@JeffBezanson I would be concerned about the security implications of accepting and storing invalid utf-8 (or other encoded) data. The Wikipedia page for UTF-8 specifically mentions that "Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft's IIS web server and Apache's Tomcat servlet container." (In particular, I think it is a mistake to think that a system that passes around invalid data is "more robust" than one that does not.)

EDIT: to clarify, I have no opposition (at the moment, at least) to accepting invalid utf8 data, as long as the UTF8String constructor converts it to valid data before it is passed to the rest of the application. I am unsure of whether this is how julia currently works or not.

JeffBezanson · 2015-05-28T06:21:27Z

Yeah, fair enough. The approach of passing an argument to request how to handle invalid data sounds pretty good.

I would like to check encoding validity early, so most functions don't have to worry about it. However I'm also concerned about the cost of checking on every string construction. Very often a UTF8String will be constructed in a way that we know the data is valid.

garrison · 2015-05-28T07:18:55Z

As a possible counter to my point, I just learned something from reading Armin Ronacher's post on UCS and UTF-8:

Go even gets away with using completely unchecked UTF-8 strings. In Rust it's impossible to construct a string in safe code with invalid UTF-8 characters. Go on the other hand lets you happily mix random bytes into your string, but all IO operations are required to ensure that the data is valid.

He goes on to say that he prefers Rust's model, and I think I do too. But the alternative (like Go) could be reasonable if the correct design considerations are made elsewhere.

ScottPJones · 2015-05-28T10:34:10Z

@JeffBezanson We are totally in agreement on that, about checking early.
There needs to be a way for constructors to bypass the checking, as the ASCIIString/UTF*String methods do, but otherwise everything should guarantee validity.
I think it actually (as I showed in #11004) knowing that strings are valid can actually speed up a lot of operations, and given that strings are immutable, you should assume that the uses of a string outweigh the making of the string (not the case for mutable strings).
@garrison Well, personally, I believe in Rust's model for immutable strings, but could go for Go's model for mutable strings, with well defined validation functions for mutable strings.
That way, you could merrily construct mutable strings, which might be temporarily invalid, and when finished, either explicitly check for validity (possibly setting a "valid" bit on the mutable string, which would be reset on any set to the string), or implicitly, when the string is either converted to a immutable string,
or an I/O operation is performed on it...

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

Fix #10959, fix #11463 bugs with UTF-8 conversions

JeffBezanson added the unicode Related to unicode characters and encodings label May 28, 2015

This was referenced May 30, 2015

Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004

Closed

RFC: Roadmap for improving string support in Julia #11558

Closed

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 24, 2015

Fix JuliaLang#10959, fix JuliaLang#11463 bugs with UTF-8 conversions

2fca588

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 27, 2015

Fix JuliaLang#10959, fix JuliaLang#11463 bugs with UTF-8 conversions

37650ef

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

StefanKarpinski closed this as completed in 91305f7 Jul 28, 2015

StefanKarpinski added a commit that referenced this issue Jul 28, 2015

Merge pull request #11624 from ScottPJones/spj/fixutf8

416a23e

Fix #10959, fix #11463 bugs with UTF-8 conversions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs in Unicode handling with UTF8String #11463

Bugs in Unicode handling with UTF8String #11463

ScottPJones commented May 28, 2015

JeffBezanson commented May 28, 2015

ScottPJones commented May 28, 2015

JeffBezanson commented May 28, 2015

ScottPJones commented May 28, 2015

ScottPJones commented May 28, 2015

garrison commented May 28, 2015

JeffBezanson commented May 28, 2015

garrison commented May 28, 2015

ScottPJones commented May 28, 2015

Bugs in Unicode handling with UTF8String #11463

Bugs in Unicode handling with UTF8String #11463

Comments

ScottPJones commented May 28, 2015

JeffBezanson commented May 28, 2015

ScottPJones commented May 28, 2015

JeffBezanson commented May 28, 2015

ScottPJones commented May 28, 2015

ScottPJones commented May 28, 2015

garrison commented May 28, 2015

JeffBezanson commented May 28, 2015

garrison commented May 28, 2015

ScottPJones commented May 28, 2015