-
-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-20715 CollationDataBuilder reset outdated prefix+contraction values #2052
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was actually a lot simpler than I'd feared, and a lot better than what I'd come up with. I like this a lot.
FYI @macchiati @pedberg-icu you might find this interesting :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I ported it over to Apple ICU and tested it, and it does solve the problem, and it's a lot more elegant than what I had. Thanks so much for your help with this!
0e3268e
to
042c48d
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Squashed, please re-approve |
unicode-org/icu#2052 unicode-org/icu#2067 Bug: chromium:1312557, chromium:1319923 Change-Id: I50c4e8ed77f55bb0ecbcd217eeb61c45e65db9a3 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/deps/icu/+/3612567 Reviewed-by: Jungshik Shin <jshin@chromium.org>
Fixes a subtle bug in the CollationDataBuilder.
While the collation builder collects and processes mappings, it uses a list of ConditionalCE32 elements for context-sensitive mappings, and it builds those into runtime data structures so that it can perform the collation algorithm during build time without duplicating the runtime code for the complex matching of context before and after a character.
When it builds runtime tries for a list of ConditionalCE32 where there are multiple contraction suffixes per prefix (context before the character), the builder temporarily stores a defaultCE32 for when none of the suffixes matches. Sometimes this is a copy of a defaultCE32 value earlier in the list. (Note that the combination of both prefixes and contraction suffixes is rare.)
Then, when a new mapping is added to this list, the corresponding runtime tries may need to be rebuilt. The new mapping might have been inserted such that the defaultCE32 value for a prefix will not be overridden, and may inadvertently be copied to the defaultCE32 for a later prefix, which could lead to bad data and subtly wrong matching behavior.
In addition, the builder frequently abandons built, temporary runtime tries and sometimes overflows its storage for them. It then clears its cache and starts over as needed. After that, a leftover defaultCE32 may point to a trie that has been overwritten, and the trie evaluation code sees bogus data and runs outside the storage for the tries.
The fix is to reset the outdated, leftover, earlier defaultCE32 values in the ConditionalCE32 list before they may be copied for later prefixes.
Many thanks to @richgillam for much tedious debugging, documenting his findings in the ticket and in PR #2050, and pointing very close to the root cause!
Note: In the future, we should look into building fewer temporary runtime tries. See https://unicode-org.atlassian.net/browse/ICU-21978
Checklist
DISABLE_JIRA_ISSUE_MATCH=true