From 890fc6296f618d1b6ed7bf5c58e2163111d7555c Mon Sep 17 00:00:00 2001 From: Robin Leroy Date: Mon, 24 Jul 2023 14:23:57 +0200 Subject: [PATCH] ICU-22404 Improve documentation of segmentation rules --- docs/userguide/boundaryanalysis/break-rules.md | 2 +- docs/userguide/dev/rules_update.md | 5 ++++- icu4c/source/tools/genbrk/genbrk.cpp | 9 +++++---- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/userguide/boundaryanalysis/break-rules.md b/docs/userguide/boundaryanalysis/break-rules.md index bfd756cbb6ba..afc28829133a 100644 --- a/docs/userguide/boundaryanalysis/break-rules.md +++ b/docs/userguide/boundaryanalysis/break-rules.md @@ -113,7 +113,7 @@ These rules will match "`abc`", "`hello_world`", `"hi-there"`, They will not match "`-abc`", "`multiple__joiners`", "`tail-`" A full match is composed of pieces or submatches, possibly from different rules, -with adjacent submatches linked by at least one overlapping character. +with adjacent submatches linked by one overlapping character. In the example below, matching "`hello_world`", diff --git a/docs/userguide/dev/rules_update.md b/docs/userguide/dev/rules_update.md index 664f644b4212..3a27420a719f 100644 --- a/docs/userguide/dev/rules_update.md +++ b/docs/userguide/dev/rules_update.md @@ -125,7 +125,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`. (If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.) - Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../userguide/boundaryanalysis/break-rules.md) for an explanation of rule syntax and behavior. + Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../boundaryanalysis/break-rules) for an explanation of rule syntax and behavior. The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include: @@ -133,12 +133,15 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov - ICU rules match a run of text that does not have boundaries in its interior (unless the rule contains a "hard break", represented by a '/'. UAX and CLDR rules, on the other hand, tell whether a single text position is or is not a break, with the rule expressing pre and post context around that position. This transformation is generally not hard, and the ICU form of the rules is often simpler. + In the line break rules, for the most part, rules begin and end with required (non-conditional) characters of some class other than `$CM`. By convention, line break rules never chain on `$CM`. Rules beginning with a combining mark all have the form `^$CM+ $Something`, meaning that they only match at the start of text, that they can't be chained into. This helps keep the overall chaining behavior of the line break rules somewhat easier to understand. 4. **Rebuild the ICU data with the updated rules.** cd icu4c/source/data make + This runs the `genbrk` tool. The tool looks for a Unicode signature byte sequence and otherwise assumes the rule files are encoded as UTF-8; in the ICU source tree, rule files are encoded as UTF-8. + 5. **Rerun the data-driven test**, `rbbi/TestExtended`. With luck, it may pass. Failures fall into two classes: - The newly added test failed. Either something is wrong with the test cases, or something is wrong with the rule updates. diff --git a/icu4c/source/tools/genbrk/genbrk.cpp b/icu4c/source/tools/genbrk/genbrk.cpp index 9e1719390909..4cee4e12c6e1 100644 --- a/icu4c/source/tools/genbrk/genbrk.cpp +++ b/icu4c/source/tools/genbrk/genbrk.cpp @@ -22,9 +22,8 @@ // // The input rule file is a plain text file containing break rules // in the input format accepted by RuleBasedBreakIterators. The -// file can be encoded as utf-8, or utf-16 (either endian), or -// in the default code page (platform dependent.). utf encoded -// files must include a BOM. +// file can be encoded as UTF-8 or UTF-16 (either endian). Files +// encoded as UTF-16 must include a BOM. // //-------------------------------------------------------------------- @@ -63,7 +62,9 @@ static UOption options[]={ void usageAndDie(int retCode) { printf("Usage: %s [-v] [-options] -r rule-file -o output-file\n", progName); - printf("\tRead in break iteration rules text and write out the binary data\n" + printf("\tRead in break iteration rules text and write out the binary data.\n" + "\tIf the rule file does not have a Unicode signature byte sequence, it is assumed\n" + "\tto be UTF-8.\n" "options:\n" "\t-h or -? or --help this usage text\n" "\t-V or --version show a version message\n"