Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 15.1 linebreaking #48

Merged
merged 35 commits into from
Jul 10, 2023
Merged

Conversation

eggrobin
Copy link

@eggrobin eggrobin commented Jun 28, 2023

[175-C23] Consensus: Replace rule LB 15 by LB 15a and LB 15b in UAX #14, as described in L2/23-063 Line breaking around quotation marks, changing the references to the sets [:Pi:] and [:Pf:] to [[:Pi:]&QU] and [[:Pf:]&QU], respectively, for Unicode Version 15.1.

Note: Added ZW to both rules as PAG will recommend.

[175-C27] Consensus: Add line breaking classes AF, AK, AP, AS, VI, and VF, as well as a new line breaking rule LB 28b, and change Line_Break property values, as described in L2/23-072.

Note: this one has been editorially renumbered to LB28a.

aheninger and others added 30 commits May 31, 2023 17:39
This is an experimental implementation of the line breaking rules proposed in the
Unicode document L2/22-080R. It is not suitable for merging into ICU main.

Limitations:
   - ICU4C only.
   - Root locale only (not implemented for the various LB tailorings).
   - New Line Break properties implemented with hard-coded UnicodeSets. (unmaintainable)
   - RBBIMonkeyTest not updated. (There are two ICU monkey tests; the other is updated.)
Copy link

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. thanks!!
  2. looks plausible to me
  3. I don't pretend to have reviewed the rules or logic.
  4. some code style comments

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
UnicodeString(rules, -1, US_INV), 0, status);
UnicodeString CMx {uR"([[\p{Line_Break=CM}]\u200d])"};
UnicodeString rules;
rules = rules + u"((\\p{Line_Break=PR}|\\p{Line_Break=PO})(" + CMx + u")*)?"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional / simpler:

UnicodeString rules =
    u"..."
    u"..."
    u"...";

using C++ string literal concatenation in the compiler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not work: CMx is not a literal (it would work if we made it a macro; do we want that?).

eggrobin added a commit to eggrobin/icu that referenced this pull request Jul 7, 2023
@echeran echeran merged commit f1a9e57 into echeran:ICU-22404-pt1 Jul 10, 2023
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants