Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-20715 CollationDataBuilder reset outdated prefix+contraction values #2052

Merged
merged 1 commit into from
Apr 4, 2022

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Mar 26, 2022

Fixes a subtle bug in the CollationDataBuilder.

While the collation builder collects and processes mappings, it uses a list of ConditionalCE32 elements for context-sensitive mappings, and it builds those into runtime data structures so that it can perform the collation algorithm during build time without duplicating the runtime code for the complex matching of context before and after a character.

When it builds runtime tries for a list of ConditionalCE32 where there are multiple contraction suffixes per prefix (context before the character), the builder temporarily stores a defaultCE32 for when none of the suffixes matches. Sometimes this is a copy of a defaultCE32 value earlier in the list. (Note that the combination of both prefixes and contraction suffixes is rare.)

Then, when a new mapping is added to this list, the corresponding runtime tries may need to be rebuilt. The new mapping might have been inserted such that the defaultCE32 value for a prefix will not be overridden, and may inadvertently be copied to the defaultCE32 for a later prefix, which could lead to bad data and subtly wrong matching behavior.

In addition, the builder frequently abandons built, temporary runtime tries and sometimes overflows its storage for them. It then clears its cache and starts over as needed. After that, a leftover defaultCE32 may point to a trie that has been overwritten, and the trie evaluation code sees bogus data and runs outside the storage for the tries.

The fix is to reset the outdated, leftover, earlier defaultCE32 values in the ConditionalCE32 list before they may be copied for later prefixes.

Many thanks to @richgillam for much tedious debugging, documenting his findings in the ticket and in PR #2050, and pointing very close to the root cause!

Note: In the future, we should look into building fewer temporary runtime tries. See https://unicode-org.atlassian.net/browse/ICU-21978

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-20715
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

DISABLE_JIRA_ISSUE_MATCH=true

richgillam
richgillam previously approved these changes Mar 26, 2022
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for doing this!

@markusicu markusicu changed the title ICU-20715 unit test case ICU-20715 CollationDataBuilder reset outdated prefix+contraction values Mar 31, 2022
richgillam
richgillam previously approved these changes Apr 1, 2022
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually a lot simpler than I'd feared, and a lot better than what I'd come up with. I like this a lot.

@markusicu markusicu marked this pull request as ready for review April 1, 2022 22:26
@markusicu
Copy link
Member Author

FYI @macchiati @pedberg-icu you might find this interesting :-)

richgillam
richgillam previously approved these changes Apr 1, 2022
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I ported it over to Apple ICU and tested it, and it does solve the problem, and it's a lot more elegant than what I had. Thanks so much for your help with this!

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu markusicu requested a review from richgillam April 1, 2022 23:04
@markusicu
Copy link
Member Author

Squashed, please re-approve

@markusicu markusicu merged commit 4833cc8 into unicode-org:main Apr 4, 2022
@markusicu markusicu deleted the coll-context-overflow branch April 4, 2022 16:10
ry pushed a commit to denoland/icu that referenced this pull request May 17, 2022
unicode-org/icu#2052
unicode-org/icu#2067

Bug: chromium:1312557, chromium:1319923
Change-Id: I50c4e8ed77f55bb0ecbcd217eeb61c45e65db9a3
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/deps/icu/+/3612567
Reviewed-by: Jungshik Shin <jshin@chromium.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants