-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004
Conversation
@mschauer No reason other than force of habit... too much time jumping between languages where types were required (C,C++,Java), and a language that had no types at all (which also has for loops which are slower than a while loop, because the for loop pushes a frame, while while does not) |
Help! How can I find out if this failed travis-ci build had anything to do with my changes? |
It seems to have failed downloading the rmath tarball, probably not something you did I think. |
I've restarted that build. You can check the build log yourself to see if the failure seems related, but if you're still not sure it's best to ask. |
@ScottPJones It was just a style remark, using |
@mschauer The other reason I used while instead of for (I did use for later on) was because my other loops were modifying the index within the loop (skipping over UTF-8 sequences or UTF-16 surrogate pairs) |
I also have a change that makes length of a UTF-8 string about twice as fast, but I'm waiting to hear back from @JeffBezanson (as it is his u8_charnum function I changed - some stuff he wrote apparently 10 or more years ago) before making a pull request (also, I'm still figuring out how to best use git/github to be able to do multiple pull requests at a time) |
You can then push that branch to your remote, and make a pull request from it, unrelated to the other branch and pull request. As to your other question, if you suspect the loop is getting optimized out incorrectly, there are two things to do:
|
@pao Again, sorry for the newbieness, but, how would I make it observable in Julia? (without greatly affecting the performance of what I want to measure) Is there any good write ups of best practices for benchmarking Julia [benchmarking small things, like this, where the very for loop may take as much time as the operation being performed...]) |
I don't know that there's any good references or generalizable rules for defeating an optimizer. It appears to be enough (at least with the current definition of function tricksy(s,n)
local l
for i in 1:n
l = length(s)
end
l
end |
Regarding the |
@kmsquire Do I need to change them before they will get merged, or can I hold off until my next round of speedups? (which will probably be in a short time after some more benchmarking) |
@ScottPJones, if you make the changes on the same branch and push to github, they'll appear here. This is the common method for updating pull requests. In some cases, that may go on for a few rounds, so when things are ready, typically we ask that people squash the commits down to one commit and then force push to the same branch. This is to keep the commit history of the main repository somewhat clean, and also makes it easier to track down problems using |
@kmsquire, I haven't been able to figure out just what I need to do to squash the commits... Help please? |
@pao, thanks for the suggestion... now I get the following (and this is with my improved length() for UTF8String), so it is almost UTF-8 is 7.7 times slower with just 16 characters, and 3753 with 64K...
|
@ScottPJones, http://julia.readthedocs.org/en/latest/manual/packages/#making-changes-to-an-existing-package contains information that may be helpful (some is relevant to base as well as packages), including stuff relating to squash. |
@ScottPJones, can you give some profiling numbers from a real application in which the UTF8-UTF16 conversions etc. are a significant portion of the time? As I'm sure you recognize, there is a tradeoff in optimizing code like this—making the code much more complicated in order to make it faster makes it harder to maintain etc. in the future. |
@stevengj Not really, for a few reasons... I am no longer consulting for the company where I architected and wrote a large part of the national language and Unicode support, so I no longer have access to internal benchmarks from that company. After being responsible for the core of a product used around the world, for almost 29 years, I would say that I know quite well the tradeoffs between code complexity and maintainability... I would say that having something that is simply too slow to be useful, so that people end up implementing their own solutions, doesn't really help. If something is O(n) on some of the most frequent operations instead of O(1), people aren't going to want to use it. Back in 1995, I had to port the product to work in Japan... at the time, I was being pushed in the direction of using a multibyte character set, but two other implementations of the language/database with Japanese versions had done so, using SJIS and EUC, and they just didn't perform that well compared to the 8-bit versions... I had heard about Unicode (it was at version 1.0, just 16-bit, no surrogates), and I pushed for using that instead... keeping a boundary between the byte-code interpreter, the database, and IO modules, so that core always used 16-bit characters (like Java, which also was designed before Unicode 2.0 and surrogates were added), and conversions are only done when setting/getting to/from the database, or reading/writing to external devices... It has been in production since 1997, and is used all over the world... and has been subjected to constant benchmarking... which has shown that the Unicode version of the platform performs within a few % of the 8-bit version... Also, for that system, since the only time that there was any UTF-16 <-> UTF-8 conversions was when data was coming/going to a file or over the network, it never was a significant portion of the time (also, because I optimized the hell out of the conversion functions). For this current project though, we may be getting (I believe) a lot of data in UTF-8 format (as JSON or XML)... so I am very concerned about the performance of those conversions. |
Hi Scott, I think Steven was just asking for some evidence of your improvements on real code (as opposed to the performance tests you've been posting). |
if len == 0 | ||
buf = Array{UInt16,1}(1) | ||
buf[1] = 0 | ||
return UTF16String(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, consider replacing this whole block with
return UTF16String(UInt16[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good... I didn't know you could just say UInt16[0] to get an array with just the 0. I'll change those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zeros(UInt16,1)
is another alternative
Saying "3753 times slower" is a bit misleading; it's just that one is O(1) and the other is O(n). I'd compare two O(n) operations, like searching for a character. |
Re: squashing: while it's good to do (and good to know how to do), generally, it's better to wait until any changes have made their way into this pull request. |
I don't love the large amount of manual inlining in this code. I'd be really curious to see how close we can get with more generic code. Ideally, a conversion to UTFxx would loop over its input string and consume Chars, letting the compiler specialize it for different input string types. In our current code, There seems to be a full UTF-8 validation routine copypasted into each function that consumes UTF-8 (and similarly for other types). I don't think that's acceptable. We need to find a better way to factor this. Aside: UTF-16 sucks. |
@JeffBezanson This was my first large bit of Julia code (before that, I'd been writing a lot of wrappers to my C libraries for packing/unpacking data, calling Aerospike, etc., nothing more than 5 lines), and I'd love to know how to do this better, without sacrificing performance. Yes, UTF-16 sucks, but so does UTF-8 in a lot of ways... try convincing Japanese customers that UTF-8 is better than SJIS encoding for their data, when it takes 50-100% more space... (I had to come up with a very fancy way of encoding Unicode data, which frequently took < 1byte per character on their data and yet works even on very short strings of just a few characters... to keep them happy...) |
@JeffBezanson I disagree that it is misleading - I was just comparing length there, and the whole point is that it is O(n) instead of O(1)... I could not change the customers' applications, and the most common operations were: getting the length of a string, searching for a single character, getting the nth substring delimited by a single character, replacing the nth substring delimited by a single character, extracting a single character at position x, extracting a substring from x to y, replacing a substring from x to y. For one particular set of customers (everybody using the implementation of Pick/Multivalue), where the Pick language uses "system delimiters" in the range 248-255, all of those incredibly frequent delimiters become 2 bytes, and mean I'd have to do searches for 2 bytes instead of one all the time, with constant matches on the first byte... I don't have any numbers anymore, but there was a big difference in speed there) |
@JeffBezanson About making things more generic, I tried to do that at first, and then ran into all sorts of headaches because of the inconsistencies between UTF-8 (UInt8, but the \0 terminator is "hidden"), UTF-16 (UInt16, but you must add an explicit \0), and UTF-32 (Char, not UInt32, need explicit \0 terminator, need to do things like Char(0) instead of 0, and reinterpret(UInt32,s[i])). |
@kmsquire Sorry - that fact that I don't have any "real code" using the conversions yet, nor access to the code & benchmarks from before, is a sore point with me! Also, any "real code" that I'm working on using the conversions would be in confidential, proprietary code for the startup I'm consulting for... |
@JeffBezanson If I add something to julia/src/support, i.e. utf.h and utf.c, for UTF-16 and UTF-32 optimizations, should I put something like |
No, that's the default for contributions to this repository. Some of the code from
Known bug. Major point of contention between those who wrote most of that code vs those trying to understand it. Documentation/comments should be kept up to date, which means more work than just writing the code. Some pieces are starting to be documented separately from the code in devdocs, some prefer external longer-form design documents rather than inline comments. I personally like the locality of comments, but either way if more people volunteer to write, read, and keep documentation of any form up-to-date it would be welcomed. |
elseif ch < 0x800 | ||
num2byte += 1 | ||
flags |= UTF_UNICODE2 | ||
elseif !is_surrogate_codepoint(ch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, codepoint
here is probably incorrect AFAICT. I don't think you replied to that particular comment of mine, but isn't codeunit
more correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are quite right, just brain rot mixing those up... I'll fix.
I'm closing this pull request as unreviewable. Please factor into smaller, reviewable pieces. |
@StefanKarpinski Do you mean the diff is unreviewable, or this PR page? Is splitting up into multiple commits what you're asking @ScottPJones or rethink/work it completely? |
Really??? |
@vtjnash Thanks for all the helpful review, I had addressed those minor notes and pushed the changes to my fork, but because this has been closed, they don't seem to show up here. |
@timholy I guess you spoke too soon, unfortunately. Thanks for all of your help reviewing this, along with everybody else. |
Well, what's closed can be reopened. But honestly, sometimes a clean start is both better and easier. Heck, on #11242 I restarted twice (see #10691, #10911). I found it was easier to copy/paste my changes into a new file, and it gave me a chance to think things through from scratch, having learned a lot about julia's guts from previous failed attempts. Finally, the good parts of my earlier attempts were salvaged with, like, 30 minutes of work, while the carry-it-forward (i.e., new stuff) took a week. So, don't take this as a statement of the value that folks find in your code, but simply a way to start from a cleaner history and let all your work shine! |
There's actually a github limitation/bug that means you literally can't reopen a closed PR if you've made any pushes to the branch after it was closed. But figuratively, I do think the core improvements here are close to ready, if we can find smaller incremental ways of getting them in piece by piece. |
As I said before, I don't think this particular change can be split up. Others who have reviewed the code seem to have agreed. As @IainNZ said:
|
We can immediately add any of the tests that would pass with the implementation we have on master. Test additions are almost always easy, noncontroversial merges. Then we look at the tests that are added here where master would fail. I bet, taken a few failing tests at a time, we can get to the same place. |
One way it can be reduced, is going back to the state before (with utf16.jl and utf32.jl), and applying the changes you've made to each function in the original place. This'll reduce the diff considerably... it may be pretty tricky but honestly it's not that big of a diff to actually do it, really it's only a few hundred lines. :s Worth a quick try on another branch? |
Sorry, @tkelman, but I've donated quite a bit of my time already to trying to fix the many serious problems I've been finding (including the one you pointed me to yesterday) in the Julia code. |
And you've been successful in fixing quite a few of them, which is hugely appreciated. I hope you won't overreact to one (large and overly-drawn-out) PR getting closed. If I had known this was going to be the resolution, I would have closed it myself and said the same thing Stefan just said but 3 or 4 weeks sooner to avoid the mess, but hindsight's 20/20. Needing a better unit test system IS A DIFFERENT ISSUE. The beauty of git is if you don't have time to see this through, anyone else can make a copy of your branch (ScottPJones@21bbaef looks like the most recently pushed commit) and try cleaning it up themselves. Preserving authorship information wherever appropriate, of course. |
@tkelman My comment about the unit test system, was because of your recommendation to go painfully bit by bit, adding back in the unit tests that don't fail (not too many of those, actually). It may be a different issue, but was still on topic replying to you. |
I do think that @StefanKarpinski 's way of acting, is detrimental to the Julia community. |
@timholy Yes, in general that is true, but I believe I've already done just that in this case... First I had a pure Julia version, then I tried to improve it using my C library, then people said it had to be pure Julia, so I rewrote it based on my C code, refactored, etc. etc. |
Not just mine. A significant amount of the feedback all along has been (paraphrasing) "this is too big to do all at once," and "this has fundamental issues due to too much repeated code" and you haven't addressed that. You don't need to add all the failing tests at once with some sophisticated test tracking system to refactor this code. If you don't have time to come up with a different way of fixing the same set of bugs, someone else will. |
Here's a potential path forward: make a new PR with the validation functions and associated tests only. Those are self-contained, valuable, and seem review-able on their own (few hundred LoC max). This will take three minutes, wouldn't require writing any new code, and adding this code won't change any existing behavior. Once those are in, then subsequent PRs could be made to start using them. |
Given that this PR has had a messier history than most, I totally agree that this is not the catastrophe you're making it out to be. If you think your PRs aren't getting merged quickly enough, just check out the history of @vtjnash's PRs. You have a very good track record of getting things in quickly because you push hard, but I think many in the community would be happier if you didn't push quite so hard 😄. My advice: just treat this as a "process" decision and nothing personal, and resubmit. Because that's the reality, and it doesn't help anything or anyone to treat it otherwise. |
@timholy I'm not making it out to be a catastrophe, but I do think that things were not handled well at all. |
Presuming the resubmission included refactoring along lines suggested by others, that's truly excellent. Cheers! |
The new PR is #11551, and honestly it looks like a big improvement; smaller diff and more separate files/concerns. :) |
Rewrote a number of the conversions between ASCIIString, UTF8String,
UTF16String, and UTF32String, and also rewrote length() for
UTF16String().
Added validation functions that can be used in many other places, which also fix a number of Unicode handling bugs, such as detecting overlong encodings, out of range characters, singleton surrogate characters, and missing continuation characters, among other problems.
Added over 150 lines of testing code to detect the above conversion problems
Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc
Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84