Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 15.1 linebreaking #48

Merged
merged 35 commits into from
Jul 10, 2023
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7a47dfb
ICU-22039 Line Break on Orthographic Syllable Boundaries
aheninger Jul 29, 2022
32dc983
eggrobin May 31, 2023
69766ff
these properties exist now
eggrobin May 31, 2023
6cfcc5d
a confused attempt at LB 15.
eggrobin May 31, 2023
d4f2014
Merge remote-tracking branch 'elango/ICU-22404-pt1' into 15.1-linebre…
eggrobin Jun 12, 2023
9f10815
Show the breakages
eggrobin Jun 13, 2023
2d9ddc3
Something that passes LineBreakTest.txt
eggrobin Jun 13, 2023
e35c0a8
update the C++ monkey for the PAG changes to LB28b.
eggrobin Jun 15, 2023
ffff12a
the monkey is angry
eggrobin Jun 15, 2023
429d014
}
eggrobin Jun 15, 2023
31ac7c3
logic is hard
eggrobin Jun 16, 2023
2fda5a6
We should probably have ZW in there.
eggrobin Jun 16, 2023
04b5cea
more edge cases
eggrobin Jun 16, 2023
091eb59
typo
eggrobin Jun 16, 2023
ff3ce87
Manual chaining
eggrobin Jun 19, 2023
979cbc8
no $CM* on $SP+
eggrobin Jun 19, 2023
a231298
Don’t chain on CM. This does not fix the bugs.
eggrobin Jun 20, 2023
5eaa1a6
increment in the iteration-expression, not in the condition
eggrobin Jun 20, 2023
c6020f2
indexing is hard
eggrobin Jun 20, 2023
aa75efa
Reorder ICU tailorings so « .123 » does not break.
eggrobin Jun 20, 2023
b0111fa
Fix nested openings, add manual tests
eggrobin Jun 20, 2023
a987ef9
No need for lb class adjustments now that the new ones exist
eggrobin Jun 20, 2023
5ae8c50
symmetry
eggrobin Jun 20, 2023
69553e7
28b to 28a, see unicode-org/edcom#53.
eggrobin Jun 21, 2023
53d8ebf
another test for good measure
eggrobin Jun 27, 2023
fac1d9d
Seemingly appease the new monkey
eggrobin Jun 27, 2023
6d7fa58
propagate renumbering
eggrobin Jun 27, 2023
e26724a
redundant []
eggrobin Jun 28, 2023
d284114
amend all the tailorings
eggrobin Jun 28, 2023
0a723e4
amend the new monkeys for all the tailorings
eggrobin Jun 28, 2023
8b3c14c
rules_update ICU4J step 1, update rbbitst.txt
eggrobin Jun 28, 2023
b819d89
rules_update ICU4J step 2, refresh data
eggrobin Jun 28, 2023
f5ec852
rules_update ICU4J 3 port java code monkeys
eggrobin Jun 28, 2023
6696161
Update new Java monkeys
eggrobin Jun 28, 2023
d44c91d
int32_t and weird indent
eggrobin Jun 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 30 additions & 5 deletions icu4c/source/data/brkitr/rules/line.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@
!!quoted_literals_only;

$AI = [:LineBreak = Ambiguous:];
$AK = [:LineBreak = Aksara:];
$AL = [:LineBreak = Alphabetic:];
$AP = [:LineBreak = Aksara_Prebase:];
$AS = [:LineBreak = Aksara_Start:];
$BA = [:LineBreak = Break_After:];
$HH = [\u2010]; # \u2010 is HYPHEN, default line break is BA.
$BB = [:LineBreak = Break_Before:];
Expand Down Expand Up @@ -64,6 +67,8 @@ $SA = [:LineBreak = Complex_Context:];
$SG = [:LineBreak = Surrogate:];
$SP = [:LineBreak = Space:];
$SY = [:LineBreak = Break_Symbols:];
$VF = [:LineBreak = Virama_Final:];
$VI = [:LineBreak = Virama:];
$WJ = [:LineBreak = Word_Joiner:];
$XX = [:LineBreak = Unknown:];
$ZW = [:LineBreak = ZWSpace:];
Expand Down Expand Up @@ -215,7 +220,27 @@ $OP $CM* $SP+ $CM+ $AL_FOLLOW?; # by rule 10, stand-alone CM behaves as AL
# by rule 8, CM following a SP is stand-alone.


# LB 14a Force a break before start of a number with a leading decimal pt, e.g. " .23"
# LB 15a
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ .;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ .;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;

# LB 15b
$LB8NonBreaks [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
$CAN_CM $CM* [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
^$CM+ [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];

# Messy interaction: manually chain between LB 15b and LB 15a on Pf Pi.
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;


# LB 15c Force a break before start of a number with a leading decimal pt, e.g. " .23"
# Note: would be simpler to express as "$SP / $IS $CM* $NU;", but ICU rules have limitations.
# See issue ICU-20303

Expand All @@ -225,7 +250,7 @@ $SP $IS / [^ $CanFollowIS $NU $CM];
$SP $IS $CM* $CMX / [^ $CanFollowIS $NU $CM];

#
# LB 14b Do not break before numeric separators (IS), even after spaces.
# LB 15d Do not break before numeric separators (IS), even after spaces.

[$LB8NonBreaks - $SP] $IS;
$SP $IS $CM* [$CanFollowIS {eof}];
Expand All @@ -235,9 +260,6 @@ $CAN_CM $CM* $IS;
^$CM+ $IS; # by rule 10, stand-alone CM behaves as AL


# LB 15
$QU $CM* $SP* $OP;

# LB 16
($CL | $CP) $CM* $SP* $NS;

Expand Down Expand Up @@ -338,6 +360,9 @@ $PR $CM* ($JL | $JV | $JT | $H2 | $H3);
($ALPlus | $HL) $CM* ($ALPlus | $HL);
^$CM+ ($ALPlus | $HL); # The $CM+ is from rule 10, an unattached CM is treated as AL

#LB 28a Do not break Orthographic syllables
($AP $CM*)? ($AS | $AK | [◌] ) ($CM* $VI $CM* ($AK | [◌] ))* ($CM* $VI | (($CM* ($AS | $AK | [◌] ) )? $CM* $VF))?;

# LB 29
$IS $CM* ($ALPlus | $HL);

Expand Down
35 changes: 30 additions & 5 deletions icu4c/source/data/brkitr/rules/line_cj.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,10 @@
!!quoted_literals_only;

$AI = [:LineBreak = Ambiguous:];
$AK = [:LineBreak = Aksara:];
$AL = [:LineBreak = Alphabetic:];
$AP = [:LineBreak = Aksara_Prebase:];
$AS = [:LineBreak = Aksara_Start:];
$BA = [:LineBreak = Break_After:];
$HH = [\u2010]; # \u2010 is HYPHEN, default line break is BA.
$BB = [:LineBreak = Break_Before:];
Expand Down Expand Up @@ -65,6 +68,8 @@ $SA = [:LineBreak = Complex_Context:];
$SG = [:LineBreak = Surrogate:];
$SP = [:LineBreak = Space:];
$SY = [:LineBreak = Break_Symbols:];
$VF = [:LineBreak = Virama_Final:];
$VI = [:LineBreak = Virama:];
$WJ = [:LineBreak = Word_Joiner:];
$XX = [:LineBreak = Unknown:];
$ZW = [:LineBreak = ZWSpace:];
Expand Down Expand Up @@ -216,7 +221,27 @@ $OP $CM* $SP+ $CM+ $AL_FOLLOW?; # by rule 10, stand-alone CM behaves as AL
# by rule 8, CM following a SP is stand-alone.


# LB 14a Force a break before start of a number with a leading decimal pt, e.g. " .23"
# LB 15a
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ .;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ .;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;

# LB 15b
$LB8NonBreaks [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
$CAN_CM $CM* [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
^$CM+ [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];

# Messy interaction: manually chain between LB 15b and LB 15a on Pf Pi.
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;


# LB 15c Force a break before start of a number with a leading decimal pt, e.g. " .23"
# Note: would be simpler to express as "$SP / $IS $CM* $NU;", but ICU rules have limitations.
# See issue ICU-20303

Expand All @@ -226,7 +251,7 @@ $SP $IS / [^ $CanFollowIS $NU $CM];
$SP $IS $CM* $CMX / [^ $CanFollowIS $NU $CM];

#
# LB 14b Do not break before numeric separators (IS), even after spaces.
# LB 15d Do not break before numeric separators (IS), even after spaces.

[$LB8NonBreaks - $SP] $IS;
$SP $IS $CM* [$CanFollowIS {eof}];
Expand All @@ -236,9 +261,6 @@ $CAN_CM $CM* $IS;
^$CM+ $IS; # by rule 10, stand-alone CM behaves as AL


# LB 15
$QU $CM* $SP* $OP;

# LB 16
($CL | $CP) $CM* $SP* $NS;

Expand Down Expand Up @@ -339,6 +361,9 @@ $PR $CM* ($JL | $JV | $JT | $H2 | $H3);
($ALPlus | $HL) $CM* ($ALPlus | $HL);
^$CM+ ($ALPlus | $HL); # The $CM+ is from rule 10, an unattached CM is treated as AL

#LB 28a Do not break Orthographic syllables
($AP $CM*)? ($AS | $AK | [◌] ) ($CM* $VI $CM* ($AK | [◌] ))* ($CM* $VI | (($CM* ($AS | $AK | [◌] ) )? $CM* $VF))?;

# LB 29
$IS $CM* ($ALPlus | $HL);

Expand Down
35 changes: 30 additions & 5 deletions icu4c/source/data/brkitr/rules/line_loose.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@
!!quoted_literals_only;

$AI = [:LineBreak = Ambiguous:];
$AK = [:LineBreak = Aksara:];
$AL = [:LineBreak = Alphabetic:];
$AP = [:LineBreak = Aksara_Prebase:];
$AS = [:LineBreak = Aksara_Start:];
$BA = [:LineBreak = Break_After:];
$HH = [\u2010]; # \u2010 is HYPHEN, default line break is BA.
$BB = [:LineBreak = Break_Before:];
Expand Down Expand Up @@ -71,6 +74,8 @@ $SA = [:LineBreak = Complex_Context:];
$SG = [:LineBreak = Surrogate:];
$SP = [:LineBreak = Space:];
$SY = [:LineBreak = Break_Symbols:];
$VF = [:LineBreak = Virama_Final:];
$VI = [:LineBreak = Virama:];
$WJ = [:LineBreak = Word_Joiner:];
$XX = [:LineBreak = Unknown:];
$ZW = [:LineBreak = ZWSpace:];
Expand Down Expand Up @@ -222,7 +227,27 @@ $OP $CM* $SP+ $CM+ $AL_FOLLOW?; # by rule 10, stand-alone CM behaves as AL
# by rule 8, CM following a SP is stand-alone.


# LB 14a Force a break before start of a number with a leading decimal pt, e.g. " .23"
# LB 15a
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ .;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ .;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;

# LB 15b
$LB8NonBreaks [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
$CAN_CM $CM* [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
^$CM+ [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];

# Messy interaction: manually chain between LB 15b and LB 15a on Pf Pi.
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;


# LB 15c Force a break before start of a number with a leading decimal pt, e.g. " .23"
# Note: would be simpler to express as "$SP / $IS $CM* $NU;", but ICU rules have limitations.
# See issue ICU-20303

Expand All @@ -232,7 +257,7 @@ $SP $IS / [^ $CanFollowIS $NU $CM];
$SP $IS $CM* $CMX / [^ $CanFollowIS $NU $CM];

#
# LB 14b Do not break before numeric separators (IS), even after spaces.
# LB 15d Do not break before numeric separators (IS), even after spaces.

[$LB8NonBreaks - $SP] $IS;
$SP $IS $CM* [$CanFollowIS {eof}];
Expand All @@ -242,9 +267,6 @@ $CAN_CM $CM* $IS;
^$CM+ $IS; # by rule 10, stand-alone CM behaves as AL


# LB 15
$QU $CM* $SP* $OP;

# LB 16
# Do not break between closing punctuation and $NS, even with intervening spaces
# But DO allow a break between closing punctuation and $NSX, don't include it here
Expand Down Expand Up @@ -349,6 +371,9 @@ $PR $CM* ($JL | $JV | $JT | $H2 | $H3);
($ALPlus | $HL) $CM* ($ALPlus | $HL);
^$CM+ ($ALPlus | $HL); # The $CM+ is from rule 10, an unattached CM is treated as AL

#LB 28a Do not break Orthographic syllables
($AP $CM*)? ($AS | $AK | [◌] ) ($CM* $VI $CM* ($AK | [◌] ))* ($CM* $VI | (($CM* ($AS | $AK | [◌] ) )? $CM* $VF))?;

# LB 29
$IS $CM* ($ALPlus | $HL);

Expand Down
35 changes: 30 additions & 5 deletions icu4c/source/data/brkitr/rules/line_loose_cj.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,10 @@
!!quoted_literals_only;

$AI = [:LineBreak = Ambiguous:];
$AK = [:LineBreak = Aksara:];
$AL = [:LineBreak = Alphabetic:];
$AP = [:LineBreak = Aksara_Prebase:];
$AS = [:LineBreak = Aksara_Start:];
$BAX = [\u2010 \u2013];
$BA = [[:LineBreak = Break_After:] - $BAX];
$HH = [\u2010]; # \u2010 is HYPHEN, default line break is BA.
Expand Down Expand Up @@ -83,6 +86,8 @@ $SA = [:LineBreak = Complex_Context:];
$SG = [:LineBreak = Surrogate:];
$SP = [:LineBreak = Space:];
$SY = [:LineBreak = Break_Symbols:];
$VF = [:LineBreak = Virama_Final:];
$VI = [:LineBreak = Virama:];
$WJ = [:LineBreak = Word_Joiner:];
$XX = [:LineBreak = Unknown:];
$ZW = [:LineBreak = ZWSpace:];
Expand Down Expand Up @@ -234,7 +239,27 @@ $OP $CM* $SP+ $CM+ $AL_FOLLOW?; # by rule 10, stand-alone CM behaves as AL
# by rule 8, CM following a SP is stand-alone.


# LB 14a Force a break before start of a number with a leading decimal pt, e.g. " .23"
# LB 15a
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ .;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ .;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;

# LB 15b
$LB8NonBreaks [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
$CAN_CM $CM* [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
^$CM+ [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];

# Messy interaction: manually chain between LB 15b and LB 15a on Pf Pi.
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;


# LB 15c Force a break before start of a number with a leading decimal pt, e.g. " .23"
# Note: would be simpler to express as "$SP / $IS $CM* $NU;", but ICU rules have limitations.
# See issue ICU-20303

Expand All @@ -244,7 +269,7 @@ $SP $IS / [^ $CanFollowIS $NU $CM];
$SP $IS $CM* $CMX / [^ $CanFollowIS $NU $CM];

#
# LB 14b Do not break before numeric separators (IS), even after spaces.
# LB 15d Do not break before numeric separators (IS), even after spaces.

[$LB8NonBreaks - $SP] $IS;
$SP $IS $CM* [$CanFollowIS {eof}];
Expand All @@ -254,9 +279,6 @@ $CAN_CM $CM* $IS;
^$CM+ $IS; # by rule 10, stand-alone CM behaves as AL


# LB 15
$QU $CM* $SP* $OP;

# LB 16
# Do not break between closing punctuation and $NS, even with intervening spaces
# But DO allow a break between closing punctuation and $NSX, don't include it here
Expand Down Expand Up @@ -367,6 +389,9 @@ $PR $CM* ($JL | $JV | $JT | $H2 | $H3);
($ALPlus | $HL) $CM* ($ALPlus | $HL);
^$CM+ ($ALPlus | $HL); # The $CM+ is from rule 10, an unattached CM is treated as AL

#LB 28a Do not break Orthographic syllables
($AP $CM*)? ($AS | $AK | [◌] ) ($CM* $VI $CM* ($AK | [◌] ))* ($CM* $VI | (($CM* ($AS | $AK | [◌] ) )? $CM* $VF))?;

# LB 29
$IS $CM* ($ALPlus | $HL);

Expand Down
35 changes: 30 additions & 5 deletions icu4c/source/data/brkitr/rules/line_loose_phrase_cj.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,10 @@
!!quoted_literals_only;

$AI = [:LineBreak = Ambiguous:];
$AK = [:LineBreak = Aksara:];
$AL = [:LineBreak = Alphabetic:];
$AP = [:LineBreak = Aksara_Prebase:];
$AS = [:LineBreak = Aksara_Start:];
$BAX = [\u2010 \u2013];
$BA = [[:LineBreak = Break_After:] - $BAX];
$HH = [\u2010]; # \u2010 is HYPHEN, default line break is BA.
Expand Down Expand Up @@ -85,6 +88,8 @@ $SA = [:LineBreak = Complex_Context:];
$SG = [:LineBreak = Surrogate:];
$SP = [:LineBreak = Space:];
$SY = [:LineBreak = Break_Symbols:];
$VF = [:LineBreak = Virama_Final:];
$VI = [:LineBreak = Virama:];
$WJ = [:LineBreak = Word_Joiner:];
$XX = [:LineBreak = Unknown:];
$ZW = [:LineBreak = ZWSpace:];
Expand Down Expand Up @@ -247,7 +252,27 @@ $OP $CM* $SP+ $CM+ $AL_FOLLOW?; # by rule 10, stand-alone CM behaves as AL
# by rule 8, CM following a SP is stand-alone.


# LB 14a Force a break before start of a number with a leading decimal pt, e.g. " .23"
# LB 15a
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ .;
($OP $CM* $SP+ | [$OP $QU $GL] $CM*) ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^([\p{Pi} & $QU] $CM* $SP*)+ .;
^([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;

# LB 15b
$LB8NonBreaks [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
$CAN_CM $CM* [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];
^$CM+ [\p{Pf} & $QU] $CM* [$SP $GL $WJ $CL $QU $CP $EX $IS $SY $BK $CR $LF $NL $ZW {eof}];

# Messy interaction: manually chain between LB 15b and LB 15a on Pf Pi.
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$LB8NonBreaks [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
$CAN_CM $CM* [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ .;
^$CM+ [\p{Pf} & $QU] $CM* ([\p{Pi} & $QU] $CM* $SP*)+ $SP $CM+ $AL_FOLLOW?;


# LB 15c Force a break before start of a number with a leading decimal pt, e.g. " .23"
# Note: would be simpler to express as "$SP / $IS $CM* $NU;", but ICU rules have limitations.
# See issue ICU-20303

Expand All @@ -257,7 +282,7 @@ $SP $IS / [^ $CanFollowIS $NU $CM];
$SP $IS $CM* $CMX / [^ $CanFollowIS $NU $CM];

#
# LB 14b Do not break before numeric separators (IS), even after spaces.
# LB 15d Do not break before numeric separators (IS), even after spaces.

[$LB8NonBreaks - $SP] $IS;
$SP $IS $CM* [$CanFollowIS {eof}];
Expand All @@ -267,9 +292,6 @@ $CAN_CM $CM* $IS;
^$CM+ $IS; # by rule 10, stand-alone CM behaves as AL


# LB 15
$QU $CM* $SP* $OP;

# LB 16
# Do not break between closing punctuation and $NS, even with intervening spaces
# But DO allow a break between closing punctuation and $NSX, don't include it here
Expand Down Expand Up @@ -380,6 +402,9 @@ $PR $CM* ($JL | $JV | $JT | $H2 | $H3);
($ALPlus | $HL) $CM* ($ALPlus | $HL);
^$CM+ ($ALPlus | $HL); # The $CM+ is from rule 10, an unattached CM is treated as AL

#LB 28a Do not break Orthographic syllables
($AP $CM*)? ($AS | $AK | [◌] ) ($CM* $VI $CM* ($AK | [◌] ))* ($CM* $VI | (($CM* ($AS | $AK | [◌] ) )? $CM* $VF))?;

# LB 29
$IS $CM* ($ALPlus | $HL);

Expand Down
Loading