Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCA 16 move numerics after digits; CLDR stop reordering by gc #762

Merged
merged 2 commits into from
May 1, 2024

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Apr 6, 2024

  • for PAG issue 99: "re-align the DUCET & CLDR: 20A8 RUPEE SIGN & FDFC RIAL SIGN"
    • makes these sort in CLDR like letter sequences, as it has done in the DUCET
  • for PAG issue 101: "re-align the DUCET & CLDR: order of groups below letters"
    • changes both DUCET & CLDR to sort non-digit numerics after digits
    • as a result, both sort orders are nearly the same
    • exceptions: ten Tibetan contractions, and CLDR tailorings of U+FFFE & U+FFFF

UTC-179: https://www.unicode.org/L2/L2024/24061.htm
PAG report -->
Section 7.4 re-align the DUCET & CLDR: order of groups below letters

  • [179-C38] Consensus: In the UCA DUCET, move the non-decimal-digit numerics to sort right after decimal digits. For Unicode Version 16.0. See document L2/24-064 item 7.4.
  • [179-A123] Action Item for Ken Whistler, PAG: In the UCA DUCET, move the non-decimal-digit numerics to sort right after decimal digits. For Unicode Version 16.0. See document L2/24-064 item 7.4.

Remaining differences between the sort orders:

~/unitools/mine/Generated/UCA/16.0.0$ diff -u Ducet/allkeys_DUCET.txt CollationAuxiliary/allkeys_CLDR.txt
--- Ducet/allkeys_DUCET.txt	2024-04-05 17:13:58.981995514 -0700
+++ CollationAuxiliary/allkeys_CLDR.txt	2024-04-05 17:13:50.253316966 -0700
@@ -1,11 +1,12 @@
-# allkeys_DUCET.txt
-# Date: 2024-04-06, 00:13:58 GMT
+# allkeys_CLDR.txt
+# Date: 2024-04-06, 00:13:49 GMT
 # © 2024 Unicode®, Inc.
 # Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
 # For terms of use, see https://www.unicode.org/terms_of_use.html
 # UCA Version: 16.0.0
 # UCD Version: 16.0.0
-# For a description of the format and usage, see CollationTest.html
+# For a description of the format and usage, see
+# http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files
 
 @version 16.0.0
 
@@ -1620,6 +1621,7 @@
 20E7  ; [.0000.011B.0002] # COMBINING ANNUITY SYMBOL
 20E8  ; [.0000.011C.0002] # COMBINING TRIPLE UNDERDOT
 20E9  ; [.0000.011D.0002] # COMBINING WIDE BRIDGE ABOVE
+FFFE  ; [.0001.0020.0002] # <noncharacter-FFFE>
 0009  ; [*0201.0020.0002] # <CHARACTER TABULATION>
 000A  ; [*0202.0020.0002] # <LINE FEED (LF)>
 000B  ; [*0203.0020.0002] # <LINE TABULATION>
@@ -21467,9 +21469,19 @@
 0F6A  ; [.3793.0020.0004][.0000.011F.0004] # TIBETAN LETTER FIXED-FORM RA
 0FB2  ; [.3794.0020.0002] # TIBETAN SUBJOINED LETTER RA
 0FBC  ; [.3794.0020.0004][.0000.011F.0004] # TIBETAN SUBJOINED LETTER FIXED-FORM RA
+0FB2 0F71 ; [.3794.0020.0002][.37AA.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA
+0FB2 0F71 0F72 ; [.3794.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN I
+0FB2 0F73 ; [.3794.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN II
+0FB2 0F71 0F74 ; [.3794.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN U
+0FB2 0F75 ; [.3794.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN UU
 0F6C  ; [.3795.0020.0002] # TIBETAN LETTER RRA
 0F63  ; [.3796.0020.0002] # TIBETAN LETTER LA
 0FB3  ; [.3797.0020.0002] # TIBETAN SUBJOINED LETTER LA
+0FB3 0F71 ; [.3797.0020.0002][.37AA.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA
+0FB3 0F71 0F72 ; [.3797.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN I
+0FB3 0F73 ; [.3797.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN II
+0FB3 0F71 0F74 ; [.3797.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN U
+0FB3 0F75 ; [.3797.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN UU
 0F64  ; [.3798.0020.0002] # TIBETAN LETTER SHA
 0FB4  ; [.3799.0020.0002] # TIBETAN SUBJOINED LETTER SHA
 0F65  ; [.379A.0020.0002] # TIBETAN LETTER SSA
@@ -39417,6 +39429,5 @@
 2FA14 ; [.FB85.0020.0002][.A291.0000.0000] # CJK COMPATIBILITY IDEOGRAPH-2FA14
 2F88F ; [.FB85.0020.0002][.A392.0000.0000] # CJK COMPATIBILITY IDEOGRAPH-2F88F
 2FA1D ; [.FB85.0020.0002][.A600.0000.0000] # CJK COMPATIBILITY IDEOGRAPH-2FA1D
-FFFE  ; [.FBC1.0020.0002][.FFFE.0000.0000] # <noncharacter-FFFE>
-FFFF  ; [.FBC1.0020.0002][.FFFF.0000.0000] # <noncharacter-FFFF>
 FFFD  ; [.FFFD.0020.0002] # REPLACEMENT CHARACTER
+FFFF  ; [.FFFE.0020.0002] # <noncharacter-FFFF>

@Ken-Whistler
Copy link
Contributor

Markus, you can pick up a small revision of unisift.c from kenfiles/uca160/ to fix the botched edit in the comment.

For PAG issue 101 "re-align the DUCET & CLDR: order of groups below letters"

From Ken:

UCA 16.0 delta 17

This implements the move of the range of non-decimal numerics, so they
get primary weights *after* 0..9 and are no longer marked as variables.

The change to unidata.txt is diffable, although it involves a large
change: 570 lines of input for these numeric entries were moved down
from before the extenders to between the entries for DIGIT NINE (and
others numerically equivalent to 9) and LATIN SMALL LETTER A. And then
there are a few comment lines of explanation added.

The change to allkeys.txt is simply describable, but not really diffable
unless you ignore the primary weight assignments. The numerics are no
longer variables, but now have primary weights in the range 2187..237F,
so sort after DIGIT NINE (with primary weight of 2186), but ahead of
LATIN SMALL LETTER A. Primary weights from LATIN SMALL LETTER A onward
were unaffected, but the move of the numerics shifted all the primary
weights for extenders, currency signs, and digits. The size of the
generated file is identical to the previous one, which is a good sign.
The number of primary weights is also identical, as expected. The first
non-variable is still U+02D0 MODIFIER LETTER TRIANGULAR COLON, as
expected, but its primary weight is 212A, instead of 2323. I assume your
code automatically adjusts to identify the weight of the first non-variable.

I also regenerated decomps.txt. It isn't impacted by the numerics
rearrangement, but it does pick up the additional synthetic
decomposition added for the Tulu-Tigalari looped virama.

You will also need to pick up a small change to the sifter source code
in order to be able to replicate this output: sifter/unisift.c

The change is very small -- I simply had to comment out two lines in the
branch in the main sift dealing with numerics which set the identified
characters to variables. The rest just all falls out automatically given
the change in the input file.
@markusicu markusicu force-pushed the uca16d17-numerics branch from 4cac19b to f13b7cf Compare April 6, 2024 02:41
@markusicu
Copy link
Member Author

Markus, you can pick up a small revision of unisift.c from kenfiles/uca160/ to fix the botched edit in the comment.

Thanks -- I changed the first commit to replace the file there with your fixed version.

@eggrobin eggrobin mentioned this pull request Apr 24, 2024
- for PAG issue 99: "re-align the DUCET & CLDR: 20A8 RUPEE SIGN & FDFC RIAL SIGN"
- for PAG issue 101: "re-align the DUCET & CLDR: order of groups below letters"
@@ -83,7 +83,7 @@ public int compareTo(SecondaryInfo arg0) {
}

static class SecondaryCounts {
private final UCA uca = UCA.buildCollator(null);
private final UCA uca = UCA.buildDucetCollator();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The null was for the Remap object which I removed. null meant DUCET, not CLDR.

The internal function now takes two primary weights, which could be -1 for the DUCET. Rather than make several call sites even less readable, I created a function that says "DUCET" and does not take parameters.

* Initializes the collation from a stream of rules in the allkeys.txt format. If the source is
* null, uses the normal Unicode data files, which need to be in BASE_DIR.
*/
public UCA(String sourceFile, String unicodeVersion) throws java.io.IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Same for the UCA constructor. The implementation function now takes two primaries instead of the obsolete class Remap, but I added a convenience constructor for the DUCET, without additional parameters that would have to be "nulled".

@@ -1799,61 +1809,4 @@ public static UCA buildCollator(Remap primaryRemap) {
UCA_Statistics getStatistics() {
return ucaData.statistics;
}

public static final class Remap {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We no longer perform a permutation! 🎉

Comment on lines -1807 to -1808
private int variableHigh;
private int firstDucetNonVariable;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We still need to carry these. The code now just moves these two primaries around rather than the otherwise obsolete Remap object.


public int variableLow = '\uFFFF';
public int nonVariableLow = '\uFFFF'; // HACK '\u089A';
public int variableHigh = '\u0000';
boolean hasExplicitVariableHigh = false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This being true replaces a test for primaryRemap!=null. It's true for CLDR where the caller provides the variableHigh on the last punctuation character, as opposed to false for the DUCET, where the allkeys.txt parser figures it out from the data.

Comment on lines -1867 to -1870
cldrCollator = buildCldrCollator(false);

cldrCollator.overrideCE("\uFFFE", 0x1, 0x20, 2);
cldrCollator.overrideCE("\uFFFF", 0xFFFE, 0x20, 2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: These overrides are both here and inside buildCldrCollator(boolean). The ones inside are used by passing in true. Seems cleaner inside. (See also issue #794)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: better to use an a meaningful enum rather than a boolean, so that people can tell immediately what
cldrCollator = buildCldrCollator(true); means rather than guess ("does false mean 'don't build'??"). Better to have cldrCollator = buildCldrCollator(UCA.Style.addFFFx);

Not a blocker though!


final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());

final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Renamed from oldVariableHigh for clarity.

final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());

final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());
int cldrVariableHigh = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Used to be in class Remap.

case UCD_Types.FORMAT:
if (ducetPrimary >= firstScriptPrimary) {
break;
if (ducetPrimary <= ducetVariableHigh && ducetPrimary > cldrVariableHigh) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We no longer reorder, but we need to find the last punctuation character for CLDR's tailored (lower) variable high primary.

primaryRemap
.addItems(spaces)
.addItems(punctuation)
.setVariableHigh()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This is where the old code found the CLDR variableHigh.

@markusicu markusicu marked this pull request as ready for review May 1, 2024 20:19
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just one minor note.

Comment on lines -1867 to -1870
cldrCollator = buildCldrCollator(false);

cldrCollator.overrideCE("\uFFFE", 0x1, 0x20, 2);
cldrCollator.overrideCE("\uFFFF", 0xFFFE, 0x20, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: better to use an a meaningful enum rather than a boolean, so that people can tell immediately what
cldrCollator = buildCldrCollator(true); means rather than guess ("does false mean 'don't build'??"). Better to have cldrCollator = buildCldrCollator(UCA.Style.addFFFx);

Not a blocker though!

@markusicu
Copy link
Member Author

@macchiati re

better to use an a meaningful enum rather than a boolean

I agree, but the boolean was your idea :-)

I will merge this as is, and I already created an issue for whether we need this option at all -- hopefully not, I would like to remove it. --> issue #794

@markusicu markusicu merged commit 5755926 into unicode-org:main May 1, 2024
27 checks passed
@markusicu markusicu deleted the uca16d17-numerics branch May 1, 2024 22:45
Copy link
Contributor

@Ken-Whistler Ken-Whistler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes for sifter look correct. No comment on the complicated unicodetools UCA changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants