Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22404 Improve documentation of segmentation rules #2532

Merged
merged 1 commit into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/userguide/boundaryanalysis/break-rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ These rules will match "`abc`", "`hello_world`", `"hi-there"`,
They will not match "`-abc`", "`multiple__joiners`", "`tail-`"

A full match is composed of pieces or submatches, possibly from different rules,
with adjacent submatches linked by at least one overlapping character.
with adjacent submatches linked by one overlapping character.

In the example below, matching "`hello_world`",

Expand Down
5 changes: 4 additions & 1 deletion docs/userguide/dev/rules_update.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,20 +125,23 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`.
(If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.)

Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../userguide/boundaryanalysis/break-rules.md) for an explanation of rule syntax and behavior.
Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../boundaryanalysis/break-rules) for an explanation of rule syntax and behavior.

The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include:

- All ICU rules run in parallel, while UAX/CLDR rules are applied sequentially, stopping after the first match. The ICU rules sometimes require extra logic to prevent a later rule from preempting an earlier rule. This can be quite tricky to express.

- ICU rules match a run of text that does not have boundaries in its interior (unless the rule contains a "hard break", represented by a '/'. UAX and CLDR rules, on the other hand, tell whether a single text position is or is not a break, with the rule expressing pre and post context around that position. This transformation is generally not hard, and the ICU form of the rules is often simpler.

In the line break rules, for the most part, rules begin and end with required (non-conditional) characters of some class other than `$CM`. By convention, line break rules never chain on `$CM`. Rules beginning with a combining mark all have the form `^$CM+ $Something`, meaning that they only match at the start of text, that they can't be chained into. This helps keep the overall chaining behavior of the line break rules somewhat easier to understand.

4. **Rebuild the ICU data with the updated rules.**

cd icu4c/source/data
make

This runs the `genbrk` tool. The tool looks for a Unicode signature byte sequence and otherwise assumes the rule files are encoded as UTF-8; in the ICU source tree, rule files are encoded as UTF-8.

5. **Rerun the data-driven test**, `rbbi/TestExtended`. With luck, it may pass. Failures fall into two classes:

- The newly added test failed. Either something is wrong with the test cases, or something is wrong with the rule updates.
Expand Down
9 changes: 5 additions & 4 deletions icu4c/source/tools/genbrk/genbrk.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@
//
// The input rule file is a plain text file containing break rules
// in the input format accepted by RuleBasedBreakIterators. The
// file can be encoded as utf-8, or utf-16 (either endian), or
// in the default code page (platform dependent.). utf encoded
// files must include a BOM.
// file can be encoded as UTF-8 or UTF-16 (either endian). Files
// encoded as UTF-16 must include a BOM.
//
//--------------------------------------------------------------------

Expand Down Expand Up @@ -63,7 +62,9 @@ static UOption options[]={

void usageAndDie(int retCode) {
printf("Usage: %s [-v] [-options] -r rule-file -o output-file\n", progName);
printf("\tRead in break iteration rules text and write out the binary data\n"
printf("\tRead in break iteration rules text and write out the binary data.\n"
"\tIf the rule file does not have a Unicode signature byte sequence, it is assumed\n"
"\tto be UTF-8.\n"
"options:\n"
"\t-h or -? or --help this usage text\n"
"\t-V or --version show a version message\n"
Expand Down
Loading