unicode-org · eggrobin · Aug 10, 2023 · Jul 24, 2023
diff --git a/docs/userguide/boundaryanalysis/break-rules.md b/docs/userguide/boundaryanalysis/break-rules.md
@@ -113,7 +113,7 @@ These rules will match "`abc`", "`hello_world`", `"hi-there"`,
 They will not match "`-abc`", "`multiple__joiners`", "`tail-`"
 
 A full match is composed of pieces or submatches, possibly from different rules,
-with adjacent submatches linked by at least one overlapping character.
+with adjacent submatches linked by one overlapping character.
 
 In the example below, matching "`hello_world`",
 

diff --git a/docs/userguide/dev/rules_update.md b/docs/userguide/dev/rules_update.md
@@ -125,20 +125,23 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
     For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`.
     (If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.)
 
-    Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../userguide/boundaryanalysis/break-rules.md) for an explanation of rule syntax and behavior.
+    Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../boundaryanalysis/break-rules) for an explanation of rule syntax and behavior.
 
     The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include:
 
     -  All ICU rules run in parallel, while UAX/CLDR rules are applied sequentially, stopping after the first match. The ICU rules sometimes require extra logic to prevent a later rule from preempting an earlier rule. This can be quite tricky to express.
 
     -  ICU rules match a run of text that does not have boundaries in its interior (unless the rule contains a "hard break", represented by a '/'. UAX and CLDR rules, on the other hand, tell whether a single text position is or is not a break, with the rule expressing pre and post context around that position. This transformation is generally not hard, and the ICU form  of the rules is often simpler.
 
+    In the line break rules, for the most part, rules begin and end with required (non-conditional) characters of some class other than `$CM`. By convention, line break rules never chain on `$CM`. Rules beginning with a combining mark all have the form `^$CM+ $Something`, meaning that they only match at the start of text, that they can't be chained into. This helps keep the overall chaining behavior of the line break rules somewhat easier to understand.
 
 4.  **Rebuild the ICU data with the updated rules.**
 
         cd icu4c/source/data
         make
 
+    This runs the `genbrk` tool.  The tool looks for a Unicode signature byte sequence and otherwise assumes the rule files are encoded as UTF-8; in the ICU source tree, rule files are encoded as UTF-8.
+
 5.  **Rerun the data-driven test**, `rbbi/TestExtended`. With luck, it may pass. Failures fall into two classes:
 
     - The newly added test failed. Either something is wrong with the test cases, or something is wrong with the rule updates.

diff --git a/icu4c/source/tools/genbrk/genbrk.cpp b/icu4c/source/tools/genbrk/genbrk.cpp
@@ -22,9 +22,8 @@
 //
 //   The input rule file is a plain text file containing break rules
 //    in the input format accepted by RuleBasedBreakIterators.  The
-//    file can be encoded as utf-8, or utf-16 (either endian), or
-//    in the default code page (platform dependent.).  utf encoded
-//    files must include a BOM.
+//    file can be encoded as UTF-8 or UTF-16 (either endian).  Files
+//    encoded as UTF-16 must include a BOM.
 //
 //--------------------------------------------------------------------
 
@@ -63,7 +62,9 @@ static UOption options[]={
 
 void usageAndDie(int retCode) {
         printf("Usage: %s [-v] [-options] -r rule-file -o output-file\n", progName);
-        printf("\tRead in break iteration rules text and write out the binary data\n"
+        printf("\tRead in break iteration rules text and write out the binary data.\n"
+            "\tIf the rule file does not have a Unicode signature byte sequence, it is assumed\n"
+            "\tto be UTF-8.\n"
             "options:\n"
             "\t-h or -? or --help  this usage text\n"
             "\t-V or --version     show a version message\n"