-
-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-13219 add -u-dx support to BreakIterator #2702
base: main
Are you sure you want to change the base?
Conversation
The tricky part will not be the behavior of break within a script, but in the boundary with another characters or between two script inside a -u-dx. For example, let's say we have -u-dx-thai-laoo and we have a run of text in thai and lao script and number without any spaces, would there any break in that run of text? or shoudl it beak in the boundary with number, or break in in the spot between the lao script and the thai script? or none at a all. |
1266364
to
862aefd
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
862aefd
to
7466a67
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
7466a67
to
b032026
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
b032026
to
a7b3d95
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour matches my understanding of the definition.
I try your proposed diff of icu4c/source/test/testdata/rbbitst.txt and below is the error I got in my PR. Is your expectation "correct"?
|
You diff actually only add one test case
for line break
instead, right? |
Looking https://github.com/unicode-org/icu/pull/2676/files#diff-b177067bbc1df57fc40ae7629a81e8df960899b9088555b010680a1c500943e2
should be
there are no reason to have a line break before the space. Line break should only happen after the SPACE not before the SPACE, right? |
a7b3d95
to
3cb8d52
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
@srl295 I copy your test change over but change it. Please read my modified version in this PR and see do you agree with that. The change are
|
3cb8d52
to
f64fe10
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
f64fe10
to
8f8766a
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java
Outdated
Show resolved
Hide resolved
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java
Outdated
Show resolved
Hide resolved
// Ask the language object if there are any breaks. It will add them to the cache and | ||
// leave the text pointer on the other side of its range, ready to search for the next one. | ||
if (lbe != null) { | ||
foundBreakCount += lbe.findBreaks(fText, rangeStart, rangeEnd, fBreaks, fPhraseBreaking); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked in detail at this, but it appears that this wouldn't catch the case where a character before rangeEnd should be excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so... how is that behavior specified in UTS 35 + UAX 29 + UAX 14?
Could we have a test case for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I mean. The meaning of dx-xxxx is that none of the xxxx characters will be processed by the break iterators.
So say that 't' stands for Thai, and . stands for other characters.
ttttt......ttttttttttt......ttttttt.......
With dx-thai, break iterators must only act on the dots (non-thai)
With your code, the iterator would skip over the first ttttt and start at the first non-Thai (the first dot)
However, I see nothing in the code that would prevent the iterator from continuing at least part-way into into the second group of ttttttttttt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... notice this part is inside a function
populateDictionary(int startPos, int endPos,...)
If you have the text "ttttt......ttttttttttt......ttttttt......."
so the first "ttttt" is in index 0-5, and the second "ttttttttttt" is in index 11-22, 28-35
the upper caller will call this function three time
- first call this with populateDictionary(0, 5...)
- second time call this with populateDictionary(11, 22...)
- third time call this with populateDictionary(28, 35...)
and the code put in the break between 0 - 5 into fBreak first call, the breaks between 11-22 into fBreak second call and the breaks between 28-35 into fBreak the third call
and the upper caller will figure out how to break the ...
With my change, when we hit any t, if (excludedFromDictionaryBreak(c)) willl return true and therrefore just advance the iterator till 5 and return out of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark, I add the following to unit test to show it does work
bi->setText(UnicodeString(u"aaอออaaaaaอออ aaaa"));
for line break, only the 1) begin, 2) between" " and "a" and 3) the end of text break
for word break, only the 1) begin, 2) between "อ" and " ", 3) between " " and "a" and 4) the end of text break
8f8766a
to
9cf2b52
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Let me try to be clearer. Suppose that
AAABBBCCCDDD What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD. From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear. What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored. |
I see. ok, you are right, that is not clear. I need to call excludedFromDictionaryBreak to adjust the startRange and endRange before passing to the findBreaks. the startRange and endRange pass to the findBreaks may need to be changed to a different values excluding these characters. |
@FrankYFTang my apologies, I have not found spare time to review this recently. i will let @mhosken in case he's able to review. it seems it's going in a good direction… certainly feel free to amend my test as as needed (it was meant as an example not as proscriptive) and close the other PR… |
throw new IllegalArgumentException("Incorrect value for dx key: " + dxs); | ||
} | ||
String script = dxs.substring(i*5, i*5+4); | ||
// Special handling of zyyy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this. Right after the length check, see if the entire dxValues value equals (case insensitive) "-zyyy". If so, return UnicodeSet.ALL_CODE_POINTS (everything, might want a static constant).
Otherwise, there is no special zyyy handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a test case also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we need to take care the case of "en-u-dx-thai-zyyy" or "en-u-dx-thai-hani-zyyy", etc too right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; the CLDR ticket clarifying dx was accepted for CLDR v44.1 (you are a watcher)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is " if the entire dxValues value equals (case insensitive) "-zyyy"" is not good enough because we may have
"en-u-dx-thai-zyyy" or "en-u-dx-thai-hani-zyyy" which the type is "thai-hani-zyyy" not just "zyyy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. Just saw what you landed in https://github.com/unicode-org/cldr/pull/3411/files that make sense. Sorry. Ignore my previous comments.
// For example, if the locale is "en-u-dx-abc-defgh", dxs is "abc-defgh" | ||
// and builder.toString() return "[[:scx=abc-:][:scx=efgh:]]" and causes | ||
// UnicodeSet constructor to throw IllegalArgumentException | ||
return new UnicodeSet(builder.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Freeze the UnicodeSet — add .freeze(). that makes it immutable and faster.
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java
Show resolved
Hide resolved
Np, your feedback is great for catching problems!
…On Wed, Dec 6, 2023 at 1:16 PM Frank Yung-Fong Tang < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java
<#2702 (comment)>:
> + String dxs) {
+ if (dxs == null) {
+ return null;
+ }
+ if (dxs.length() % 5 != 4) {
+ throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);
+ }
+ // Change from "thai" to "[[:scx=thai:]]" or "thai-arab" to "[[:scx=thai:][:scx=arab:]]"
+ StringBuilder builder = new StringBuilder("[");
+ int items = 1 + (dxs.length() / 5);
+ for (int i = 0; i < items; i++) {
+ if (i > 0 && dxs.charAt(i*5-1) != '-') {
+ throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);
+ }
+ String script = dxs.substring(i*5, i*5+4);
+ // Special handling of zyyy
oh. Just saw what you landed in
https://github.com/unicode-org/cldr/pull/3411/files that make sense.
Sorry. Ignore my previous comments.
—
Reply to this email directly, view it on GitHub
<#2702 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMFIZ2UB57SAEBL2FWDYIDOEBAVCNFSM6AAAAAA7LXMYCKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONRYGYZTKMRTGM>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Please ignore my update. I am still working on this PR. It is not ready for review. After Mark point out some issue, I found my design was wrong and need a more intensify rework. |
6868938
to
2dfa13c
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes the problems I noted
I'm getting user complaints again on this. Can we action this. Some fix for disabling dictionary breaking has been requested since, well I can't find out since I can't get to the old bug tracker, but it came into the latest tracker in 2019. Perhaps the best is the enemy of the good here? The only people, that I know of, that are affected by this are those using minority languages, are inserting ZWSP for word breaks and are dealing with correctly tagged text. Do we have to refine this fix for the non use cases as well, before we can fix for the actual use case? I'm sorry that my frustration is showing. But we seem to be more concerned about people who do the wrong thing than those who do the right thing (and tag correctly, by some definition). The really correct solution is that if the text is not tagged with the language of the dictionary, then no dictionary breaking should occur. I realise that that is just too much for most people and so we have special tagging. But can we please get something out for these users who are able to tag correctly? Please shipit already. |
@FrankYFTang is this going to be merged for 75? |
no, the issue is more complicated than my PR did. |
Do you have more detail? |
It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination. |
Frank, |
sorry, I need to pick up the work again. |
ICU-13219 Fix ICU-13219 add -u-dx- support to BreakIterator ICU-13219 Fix ICU-13219 update
334bd42
to
ba1260f
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Checklist