Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs in Unicode handling with UTF8String #11463

Closed
ScottPJones opened this issue May 28, 2015 · 9 comments
Closed

Bugs in Unicode handling with UTF8String #11463

ScottPJones opened this issue May 28, 2015 · 9 comments
Labels
unicode Related to unicode characters and encodings

Comments

@ScottPJones
Copy link
Contributor

There are a number of bugs related to handling UTF-8 encoding in Julia.
A number of these are shown by the test routine in the following gist:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc
The types of errors are dealing with:

  1. overlong encoded characters
  2. missing continuation bytes at the end of a string
  3. encoding of UTF-16 surrogates (where there is a surrogate pair, encoded as 2 3-byte sequences)
  4. handling of invalid UTF-16 surrogates (either trailing first, or missing trailing)
  5. characters outside legal Unicode range (i.e.> 0x10ffff)
@JeffBezanson
Copy link
Member

Are the bugs all in converting between different encodings, or are you expecting the UTF8String constructor to throw errors? Currently it's intentional that the UTF8String constructor never throws errors. That comes from trying to be more robust when given invalid data. I'd be ok with changing this however.

@JeffBezanson JeffBezanson added the unicode Related to unicode characters and encodings label May 28, 2015
@ScottPJones
Copy link
Contributor Author

No, this is strictly with the convert and utf8 methods, just like the ones I'm trying to fix with UTF-16/32
(the never ending PR!).
I did bring up an issue that there's a nasty hole, in that there's no guarantee that strings are immutable, because you can keep an alias to the array handed to UTF*String and fiddle with it's private parts.
Maybe that's a 0.5 cleanup... (making the public UTF*String constructors safe, while leaving a backdoor for things like the utf* and convert methods to call... I have no idea what would be a good approach...)

@JeffBezanson
Copy link
Member

I'm curious: is there some way you plan to make use of encoding-related errors at the application level? I've often seen data sets with a bit of corruption, or maybe UTF-8 with some Latin-1 mixed in, and it can be nice to just ignore the bad data and keep running. Granted this is pretty fast and loose, but recovering from the exceptions can be quite difficult.

@ScottPJones
Copy link
Contributor Author

I noticed that there was a specific convert method (unfortunately, only for UTF-8, I believe), that has a 3rd argument, for a replacement character/string. There could also be a 4th argument, for options (like I have at the low level), for how invalid data is handled... (allow overlong nul, allow all overlong encodings, allow surrogate pairs in UTF-8 or UTF-32, use replacement character/string or throw error)
My one rule though, which I hope you won't ask me to break, is that all strings produced by the conversion routines should be strictly valid strings (UTF-8, UTF-16, or UTF-32). If somebody needs to produce "Java" style modified UTF-8 (i.e. '\0' as 0xc0 0x80), they can write a simple method that replaces all nul bytes with the overlong sequence.

Does that sound reasonable to you?

@ScottPJones
Copy link
Contributor Author

Except for the fact that there is already a 3 arg convert [only for UTF8String], the options really should be the third (optional) argument, and the 4th, if present, be the replacement character/string [default to 0xfffd, as now, if the replace instead of throw error option is on in the 3rd argument].

@garrison
Copy link
Member

@JeffBezanson I would be concerned about the security implications of accepting and storing invalid utf-8 (or other encoded) data. The Wikipedia page for UTF-8 specifically mentions that "Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft's IIS web server and Apache's Tomcat servlet container." (In particular, I think it is a mistake to think that a system that passes around invalid data is "more robust" than one that does not.)

EDIT: to clarify, I have no opposition (at the moment, at least) to accepting invalid utf8 data, as long as the UTF8String constructor converts it to valid data before it is passed to the rest of the application. I am unsure of whether this is how julia currently works or not.

@JeffBezanson
Copy link
Member

Yeah, fair enough. The approach of passing an argument to request how to handle invalid data sounds pretty good.

I would like to check encoding validity early, so most functions don't have to worry about it. However I'm also concerned about the cost of checking on every string construction. Very often a UTF8String will be constructed in a way that we know the data is valid.

@garrison
Copy link
Member

As a possible counter to my point, I just learned something from reading Armin Ronacher's post on UCS and UTF-8:

Go even gets away with using completely unchecked UTF-8 strings. In Rust it's impossible to construct a string in safe code with invalid UTF-8 characters. Go on the other hand lets you happily mix random bytes into your string, but all IO operations are required to ensure that the data is valid.

He goes on to say that he prefers Rust's model, and I think I do too. But the alternative (like Go) could be reasonable if the correct design considerations are made elsewhere.

@ScottPJones
Copy link
Contributor Author

@JeffBezanson We are totally in agreement on that, about checking early.
There needs to be a way for constructors to bypass the checking, as the ASCIIString/UTF*String methods do, but otherwise everything should guarantee validity.
I think it actually (as I showed in #11004) knowing that strings are valid can actually speed up a lot of operations, and given that strings are immutable, you should assume that the uses of a string outweigh the making of the string (not the case for mutable strings).
@garrison Well, personally, I believe in Rust's model for immutable strings, but could go for Go's model for mutable strings, with well defined validation functions for mutable strings.
That way, you could merrily construct mutable strings, which might be temporarily invalid, and when finished, either explicitly check for validity (possibly setting a "valid" bit on the mutable string, which would be reset on any set to the string), or implicitly, when the string is either converted to a immutable string,
or an I/O operation is performed on it...

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 24, 2015
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 27, 2015
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
StefanKarpinski added a commit that referenced this issue Jul 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

3 participants