Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21592 Update cj normal/loose linebreak per CSS #1991

Conversation

pedberg-icu
Copy link
Contributor

@pedberg-icu pedberg-icu commented Feb 21, 2022

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21592
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

This updates the CJ normal & loose linebreak tailorings per a recent change agreed on for CSS:

  • For normal, CSS no longer allows breaks before \u2010 HYPHEN or \u2013 EN DASH, i.e. it makes the behavior for these back to what standard linebreak does. These are normally in linebreak class BA but the cj normal tailoring has split them off into a separate class BAX to allow breaks; this PR retracts that.
  • For loose, CSS allows breaks before \u2010 HYPHEN or \u2013 EN DASH only after characters of class ID; for other characters it no longer does. These characters are still split into a separate class BAX, but the first part of rule LB 21 that formerly allowed a break before BAX now needs to be split into two rules, one for non-ID and one for ID. (BAX also needed to ba added to $CanFollowIS in rule LB 14a).

The separate rules for RBBIMonkeyTest in icu4c/source/test/testdata/break_rules/ and icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/ needed to be updated accordingly. However they also needed one additional update for line_loose_cj.txt: They needed the following addition corresponding to something that has been in the standard rules for a while, but whose absence in the RBBIMonkeyTest did not previously cause a problem:

LB21.3:      CM+ [BA HY NS];

While the CI tests are running I will in parallel run an extended-duration version of RBBIMonkeyTest locally to check for any hard-to-find problems, and note the result below.

@pedberg-icu pedberg-icu marked this pull request as draft February 21, 2022 06:44
@pedberg-icu pedberg-icu force-pushed the ICU-21592-update-cj-linebreak-per-css branch from 197b984 to f4cf01d Compare February 21, 2022 18:14
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/testdata/break_rules/line_loose_cj.txt is different
  • icu4c/source/test/testdata/rbbitst.txt is different
  • icu4j/main/shared/data/icudata.jar is different
  • icu4j/main/shared/data/icutzdata.jar is different
  • icu4j/main/shared/data/testdata.jar is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/line_loose_cj.txt is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu pedberg-icu force-pushed the ICU-21592-update-cj-linebreak-per-css branch from f4cf01d to 40cdb9e Compare February 21, 2022 18:19
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu pedberg-icu force-pushed the ICU-21592-update-cj-linebreak-per-css branch from 4aa996a to 9203ec6 Compare February 22, 2022 00:55
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@pedberg-icu
Copy link
Contributor Author

/azp run CI-Exhaustive

@pedberg-icu pedberg-icu marked this pull request as ready for review February 22, 2022 00:57
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@pedberg-icu
Copy link
Contributor Author

I ran the ICU4C testMonkey for 400000 iterations (compared to 100 for a normal quick test), which took about 45 min; no problems found.

@FrankYFTang
Copy link
Contributor

@allensu05 please also take a look.

Copy link
Contributor

@aheninger aheninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the main rules look good.

I think the LB21 changes in the line_loose_cj.txt monkey test rules can be simplified, but they're probably good enough for the moment, since you're under time pressure.

I'll play with them, and make a followup PR if I come up with something better.

@@ -200,8 +201,10 @@ LB20.09: ^(HY | HH) CM* AL;

LB21a: HL CM* (HY | BA | BAX) CM* [^CM CB]?;

LB21.1: . CM* [BA HY NS];
LB21.2: BB CM* [^CM CB];
LB21.1: [^BK CR LF NL CM ZW SP CB ID] CM* [BA BAX HY NS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the long negated set is necessary; the hard breaks should never reach this point, having been handled by earlier rules. Monkey test rules are handled sequentially, unlike the main production rules, which are run in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andy for looking at this and for approving!

@pedberg-icu pedberg-icu merged commit 4cfe96c into unicode-org:main Feb 22, 2022
aheninger added a commit to aheninger/icu that referenced this pull request Feb 25, 2022
This is a followup to PR unicode-org#1991, Update cj normal/loose linebreak per CSS

The original change to the line_loose_cj rules involved splitting hyphens out
of the BA (Break After) class, allowing a break when they follow an ID. This
change simplifies the the rules for doing that.

It also fixes a problem with the original change that had altered the behavior
of BAX hyphens that followed Regional Indicators or Unattached Combining Marks.
pedberg-icu pushed a commit that referenced this pull request Feb 25, 2022
This is a followup to PR #1991, Update cj normal/loose linebreak per CSS

The original change to the line_loose_cj rules involved splitting hyphens out
of the BA (Break After) class, allowing a break when they follow an ID. This
change simplifies the the rules for doing that.

It also fixes a problem with the original change that had altered the behavior
of BAX hyphens that followed Regional Indicators or Unattached Combining Marks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants