-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs in Unicode handling with UTF8String #11463
Comments
Are the bugs all in converting between different encodings, or are you expecting the |
No, this is strictly with the convert and utf8 methods, just like the ones I'm trying to fix with |
I'm curious: is there some way you plan to make use of encoding-related errors at the application level? I've often seen data sets with a bit of corruption, or maybe UTF-8 with some Latin-1 mixed in, and it can be nice to just ignore the bad data and keep running. Granted this is pretty fast and loose, but recovering from the exceptions can be quite difficult. |
I noticed that there was a specific Does that sound reasonable to you? |
Except for the fact that there is already a 3 arg convert [only for |
@JeffBezanson I would be concerned about the security implications of accepting and storing invalid utf-8 (or other encoded) data. The Wikipedia page for UTF-8 specifically mentions that "Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft's IIS web server and Apache's Tomcat servlet container." (In particular, I think it is a mistake to think that a system that passes around invalid data is "more robust" than one that does not.) EDIT: to clarify, I have no opposition (at the moment, at least) to accepting invalid utf8 data, as long as the |
Yeah, fair enough. The approach of passing an argument to request how to handle invalid data sounds pretty good. I would like to check encoding validity early, so most functions don't have to worry about it. However I'm also concerned about the cost of checking on every string construction. Very often a |
As a possible counter to my point, I just learned something from reading Armin Ronacher's post on UCS and UTF-8:
He goes on to say that he prefers Rust's model, and I think I do too. But the alternative (like Go) could be reasonable if the correct design considerations are made elsewhere. |
@JeffBezanson We are totally in agreement on that, about checking early. |
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
There are a number of bugs related to handling
UTF-8
encoding in Julia.A number of these are shown by the test routine in the following gist:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc
The types of errors are dealing with:
UTF-16
surrogates (where there is a surrogate pair, encoded as 2 3-byte sequences)UTF-16
surrogates (either trailing first, or missing trailing)> 0x10ffff
)The text was updated successfully, but these errors were encountered: