diff --git a/docs/site/development/development-process/design-proposals/specifying-text-break-variants-in-locale-ids.md b/docs/site/development/development-process/design-proposals/specifying-text-break-variants-in-locale-ids.md new file mode 100644 index 00000000000..193e3414cd5 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/specifying-text-break-variants-in-locale-ids.md @@ -0,0 +1,536 @@ +--- +title: Specifying text break variants in locale IDs +--- + +# Specifying text break variants in locale IDs + +| | | +|---|---| +| Author | Peter Edberg | +| Date | 2014-11-11, last update 2016-10-20 | +| Status | Proposal | +| Feedback to | pedberg (at) apple (dot) com | +| Bugs | See list below | + +This proposal discusses options for extending Unicode locale identifiers to specify text break variants with a locale. It was prompted by CLDR and ICU bugs including the following, as well as by other requests: + +- CLDR #[2142](http://unicode.org/cldr/trac/ticket/2142), Alternate Grapheme Clusters +- CLDR #[2161](http://unicode.org/cldr/trac/ticket/2161), Grapheme break iterator with legacy behavior +- CLDR #[2825](http://unicode.org/cldr/trac/ticket/2825), Add aksha grapheme break +- CLDR #[2975](http://unicode.org/cldr/trac/ticket/2975), Support legacy grapheme break +- CLDR #[4931](http://unicode.org/cldr/trac/ticket/4931), Provide mechanism for parameterizing linebreak, etc. +- CLDR #[7032](http://unicode.org/cldr/trac/ticket/7032), BCP47 for break exceptions +- CLDR #[8204](http://unicode.org/cldr/trac/ticket/8204), Other line break parameterization to support CSS word-break, etc. +- ICU #[9379](http://bugs.icu-project.org/trac/ticket/9379), Request to add Japanese linebreak tailoring selectable as variations +- ICU #[11248](http://bugs.icu-project.org/trac/ticket/11248), Improve C/J FilteredBreakIterator, move to draft +- ICU #[11530](http://bugs.icu-project.org/trac/ticket/11530), More efficient representation for multiple line break rule sets +- ICU #[11531](http://bugs.icu-project.org/trac/ticket/11531), Update RBBI TestMonkey to test line break variants +- ICU #[11770](http://bugs.icu-project.org/trac/ticket/11770), BreakIterator should support new locale key "ss" +- ICU #[11771](http://bugs.icu-project.org/trac/ticket/11771), FilteredBreakIterator should move from i18n to common + +## I. Options needed (as known so far) + +### A. Grapheme cluster break + +Need to choose one of the following (current CLDR/ICU implementation uses extended grapheme clusters): + +- Legacy UAX #29 grapheme clusters (also called spacing units). +- Extended UAX #29 grapheme clusters: legacy clusters plus also include spacing combining marks in Indic scripts, and Thai SARA AM and Lao AM (but not other spacing vowels in SE Asian scripts). +- Aksaras (Indic & SE Asian consonant/vowel clusters or syllables): extended clusters plus also include consonant-virama sequences, and spacing vowels in SE Asian scripts. + +### B. Word break + +Currently uses dictionary-based break for sequences in CJK scripts (Han/Kana/Hangul) or SE Asian scripts (LineBreak property value SA/Complex\_Context: Thai, Lao, Khmer, Myanmar, etc.); we need a locale keyword that can turn this on or off (i.e. off to use basic UAX #29 word break), at least for CJK. + +### C. Sentence break + +We need a locale keyword to control use of ULI suppressions data (i.e. to determine whether we should wrap the UAX29-based break iterator in a FilteredBreakIterator instance for the locale, and to determine which suppressions set to use). + +### D. Line break (highest priority) + +Currently ICU uses dictionary-based break for text in SE Asian scripts only. The two most important needs for line break control are: + +- For Japanese text, control whether line breaks are allowed before small kana and before the prolonged sound mark 30FC; this corresponds to (most of) the distinction between CSS level 3 strict and normal line break (see below), and is implemented by treating LineBreak property value CJ as either NS (strict) or ID (normal). +- For Korean text, control whether the line break style is E. Asian style (breaks can occur in the middle of words) or “Western” style (breaks are space based), as described in UAX 14. + +Other desirable capabilities include: + +- In a CJK-language context, control over whether breaks are allowed in the middle of words in alphabetic scripts that normally use a space-based approach (e.g. Latin, Greek, Cyrillic). Currently fullwidth Latin letters have LineBreak property value ID and do allow such breaks, but normal Latin letters are AL and do not. +- In a CJK-language context, explicit control over whether characters with LineBreak property value AI resolve to ID or AL (UAX 14 recommends using resolved East Asian Width to do this, but in the absence of that or any other higher-level mechanism they default to AL). This is somewhat related to the previous bullet. Note that characters with value AI include some symbols, punctuation, superscript digits, modifier letters, etc. +- Full control over CSS line break styles, see below (these can be used to control most of the above line break features) + +## II. Notes on CSS level 3 line break + +(from draft of Jun 2015, [http://dev.w3.org/csswg/css-text/#line-breaking](http://dev.w3.org/csswg/css-text/#line-breaking)) + +CSS has two independent properties for controlling line break behavior: + +### A. The line-break property + +This is mainly about break behavior for punctuation and symbols, though it does affect small kana. The rules are intended to specify behavior that may be language-specific, but explicit rules are provided for CJK. Besides the “auto” value, there are three specific values for this property. + +- **strict:** The most restrictive rules, for longer lines and/or ragged margins. Prevents break before small kana and before prolonged sound mark 30FC (this is the set of characters with LineBreak property value CJ, which have general category Lo or Lm). +- **normal:** Allows break before small kana and before prolonged sound mark 30FC. If the content language is Chinese or Japanese, also allows breaks before hyphen like characters: ‐ U+2010, – U+2013, ~ U+301C, ゠ U+30A0 (LineBreak property value BA for the first two, NS for the second two; general category Pd for all four). +- **loose:** The least restrictive, used for short lines as in newspapers. In addition to breaks allowed for normal, allows breaks before iteration marks (々 U+3005, 〻 U+303B, ゝ U+309D, ゞ U+309E, ヽ U+30FD, ヾ U+30FE, all with LineBreak property value NS and general category Lm) and breaks between characters with LineBreak property value IN (inseparable). If the content language is Chinese or Japanese, also allows breaks before certain centered punctuation marks, before suffixes and after prefixes. + +### B. The word-break property + +This only controls break opportunities between letter-like characters (including ideographs), and has 3 possible values. Symbols that break in the same way as letters are affected in the same way by these options. + +- **normal:** Words break according to their customary rules. For Korean this specifies E. Asian style break behavior. +- **break-all:** Allow breaks within words (between any two “typographic letter units” of general category L or N) unless forbidden by a line-break setting. This is mainly intended for a primarily-CJK context to allow breaks in the middle of normal Latin, Cyrillic, and Greek words. +- **keep-all:** Prohibit breaks between letters regardless of line-break options, except where opportunities exists due to dictionary-based break. For Korean this option specifies “western”-style line break. This is also useful when short CJK snippets are included in text that is primarily in a language using space-based breaking. + +## III. Proposed -u extension keys + +### A. For control of grapheme cluster break + +For gb, current default is extended. + +``` + + + + + + + ``` + +Will also need a word break parameter key to control whether dictionary-based work break is used, probably need separate control for at least for CJ, Korean, and SEAsian scripts; no key proposed yet. + +### C. For control of sentence break + +(Type key not needed yet and values undetermined, just reserve it) + +`````` + +For ss, current default is none. + +``` + + + + + +``` + +### D. For control of line break + +The current proposal is to use the *type* to specify the CSS line-break property; this can be used in older implementations as e.g. “@lb=strict”. One or more additional parameter keywords are provided to permit control of the CSS word-break property and to permit control of whether AI is treated as AL or ID. + +D1. Supporting CSS line-break + +For lb, the current default is normal for the "ja" locale, but strict for others; it should probably be normal for all since the distinction is mainly relevant for Japanese), and the discussion below assumes that change. + +``` + + + + + + + +``` + +D2. Supporting other controls including CSS word-break (for line break), first idea (2014-11) + +For the other controls, including support of the CSS word-break property, I think it is best to have separate control over how certain sets of characters are treated: + +- Treat Hangul (characters with LineBreak property value H2, H3, JL, JV, JT) per UAX 14 (default, for E. Asian break style), or as AL (for space-based break style, part of CSS word-break=keep-all). +- Treat characters with LineBreak property value ID per UAX #14 (default, for E. Asian break style) or as AL (for space-based break style, part of CSS word-break=keep-all). Is this correct or is the real goal just to eliminate breaks between ID? +- Treat alphabetic and numeric characters (General Category L and N) per UAX #14 (default), or as ID (to get behavior like CSS word-break=break-all). +- Treat characters with LineBreak property value AI as AL (default per UAX #14) or as ID. + +``` + + + +``` + +where LB\_CLASS\_MAP\_CODE is a sequence of one or more of the following codes (separated by - or \_): + +- hang2al (treat Hangul as AL) +- id2al (treat ID as AL) +- alnum2id (treat normal alphabetic/numeric as ID) +- ai2id (treat AI as ID) + +Then for example CSS lev 3 word-break=keep-all could be indicated as “-u-lc-hang2al-id2al”. + +D3. Supporting CSS word-break (for line break), second idea (2015-07) + +I now think explicit remapping of certain classes is the wrong approach for supporting the CSS word-break options for line break control: + +- These options are not defined in terms of UAX #14 LineBreak property values, but rather in terms of general categories L and N. +- The specific definition of the CSS word-break options (and line-break options) may change somewhat over time; we need locale tags that map to the current CSS definition. +- We may *also* want other kinds of line-break controls whose behavior does *not* change, and whose behavior *may* be defined in terms of LineBreak property values (as with the proposal in section D2 above), but that is a separate consideration. + +Thus I propose the following . + +``` + + + + + + + +``` + +### E. Other ideas + +For linebreak control: + +- CLDR #4391 proposed using “-u-lb-strictja” to specify CSS line-break=strict. + - Mark Davis suggested that the -lb- keyword could take multiple values including all of those proposed for the separate -lc- keyword, thus eliminating the need for the -lc- keyword; for example, “-u-lb-strict-hang2al-id2al”. + +Overall: Another suggestion goes further than the second bullt above: Have just a single keyword to specify all break variants; it would be followed by a list of attributes that would all share a single namespace, and whose names would need to identify which type of break they affected. Examples might include gblegacy, gbextend, gbaksara (use one to specify grapheme break); ssnone, ssstd (use one to specify sentence break suppressions); etc. While this consumes less of the -u keyword namespace, it is less flexible at mapping to values specified in resource attributes, such as different types of sentence break suppression data, unless significant restrictions are placed on those attribute values. + +### F. Current status + +F1. keyword -lb- + +In the CLDR meeting of 2014-Nov-19, it was agreed to add the -lb- keyword with at least the values "strict", "normal" and "loose" for support of the corresponding CSS level 3 behavior; for legacy-style Unicode locale IDs using '@', "lb=" should be used. The implementation details are not yet determined or specified, nor are the details of any locale-specific override behavior. + +Current (2015-02-18) work under CLDR #[4931](http://unicode.org/cldr/trac/ticket/4931) and ICU #[9379](http://bugs.icu-project.org/trac/ticket/9379) includes the following, approved in CLDR and ICU meetings: + +1. Add new CLDR file common/bcp47/segmentation.xml (name OK?) with the following: + +``` + + + + + + + + + +``` + +2. In CLDR file common/dtd/ldmlICU.dtd, add "alt" as an attribute for the \ element (and allow multiple \ elements). +3. In ICU icu/trunk/source/data/xml/brkitr/ files such as root.xml, fi.xml, and ja.xml, add lines mapping the line break types to corresponding rule files, e.g. in root: + +``` + + + + + +``` + +(Note that we need to add brkitr locales for zh and zh\_Hant since they have non-standard CSS line break types, like ja) + +4. In CLDR, update tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to handle the alts (3 added lines). +5. In ICU, add 6 new line break rule files in source/data/brkitr/ (and delete line\_ja.txt): + +``` +line_loose.txt + +line_loose_cj.txt + +line_loose_fi.txt + +line_normal.txt + +line_normal_cj.txt + +line_normal_fi.txt +``` + +These result in an increase of about 630K bytes (2.5%) in the data file. These can be tailored out in cases for which it is a problem, either by deleting lines from the ICUdata/xml/brkitr/ files if building from CLDR data, or by deleting corresponding lines in the data/brkitr/\.txt files and deleting the unused files from BRK\_SOURCE in data/brkitr/brkfiles.mk. #[11530](http://bugs.icu-project.org/trac/ticket/11530) is to investigate a more efficient way of representing the line break rule variants. + +Note that the CLDR representation of the line break rules have not yet been updated to match (they are currently ignored when generating ICU data). + +6. In ICU4C, update BreakIterator::makeInstance to map locale to the correct ruleset (about 10 lines, not yet committed)... similar change in ICU4J. +7. Update testate/rbbitst.txt to test the variants. More extensive monkey tests for the variants are covered by #[11531](http://bugs.icu-project.org/trac/ticket/11531). + +F2. keywords -ss-, -lw- + +Proposal for CLDR & ICU meetings 2015-Jul-08: + +1. In CLDR file common/bcp47/segmentation.xml add the following (approved in CLDR meeting 2015-Jul-08): + +``` + + + + + + + + + +``` + +Default value is "normal". English names for the values are: + +• normal: "Normal line breaks for words" + +• breakall: "Allow line breaks in all words" + +• keepall: "Prevent line breaks in all words" + +``` + + + + + + + +``` + +Current default value is "none". In the future we hope to make the default "standard". English names for the values are: + +- normal: "Normal sentence breaks per Unicode specification" +- standard: "Prevent sentence breaks after standard abbreviations" + +2. In ICU BreakIterator, initial support will be incomplete (details for ICU4C below, similar approach in ICU4J): + +a) In ICU4C BreakIterator::makeInstance, for kind = UBRK\_SENTENCE, if locale has key "ss" with value "standard", then call FilteredBreakIteratorBuilder on the result of BreakIterator::buildInstance to produce a new BreakIterator\* which supports the sentence break exceptions. Notes: + +- Currently FilteredBreakIteratorBuilder does not have a way to support different segmentation suppression sets, it only supports the "standard" set. +- A BreakIterator produced in this way currently supports the next() method but not the other BreakIterator methods for moving through text (see [class details](http://icu-project.org/apiref/icu4c/classicu_1_1FilteredBreakIteratorBuilder.html#details)). This should be fixed fairly soon. + +b) In ICU4C RuleBasedBreakIterator::handleNext and handlePrevious, for now we can implement an approximation of support for the key "lw" values by alteriing the character classes as follows (similar to the behavior in section D2 above): + +- For "keepall", if the class is Hangul (H2, H3, JL, JV, JT) or ID, remap to AL +- For "breakall", if the class is AL, HL, AI, or NU, remap to ID. + +More complete support is dependent on a mechanism for turning on and off certain rules, see ICU #[11530](http://bugs.icu-project.org/trac/ticket/11530). + +## IV. Implementation notes + +What I had in mind was that the break type selection (gb, lb) would be implemented by selection of different break table resources, while the parameter keywords (ss, lc) would be implemented in code (changing line break classes, perhaps with an annotation in the tables along the lines suggested in http://unicode.org/cldr/trac/ticket/4931). However, it is not clear how to implement selection of different tables given the current resource structure in ICU (which does not exactly mirror the CLDR structure). + +### A. CLDR XML structure + +Currently in CLDR we can have a structure locale-specific break iterator data icu/trunk/source/data/xml/brkitr/xx.xml as follows; except for the suppressions data, this is otherwise ignored for building ICU data (segmentation type is GraphemeClusterBreak, WordBreak, LineBreak, SentenceBreak): + +``` + + + + + + + + + …. + + + + + + …. + + + + + + + + … + + to specify the specific variant (corresponds to the value for the -gb or -lb keyword, for example), though this would currently be ignored for LDML to ICU conversion: + +`````` + +Handling of default values and elements without "alt" is discussed in section E below. + +### B. ICU XML source structure + +In ICU we have XML source data and generated txt data. The XML source structure is specified by + +http://www.unicode.org/repos/cldr/trunk/common/dtd/ldmlICU.dtd + +and currently looks like this for root (any locale-specific data uses a subset of this): + +``` + + + + + + + + + + + or e.g. "line_xx.brk" in locale-specific data + + … + + + + + + + + + + … + + + + + + + + +``` + +Note that the following attributes for the boundaries subelements (icu:word etc.) are defined in CLDR’s ICU DTD but currently unused: + +```icu:class NMTOKEN #IMPLIED``` + +```icu:append NMTOKEN #IMPLIED``` + +```icu:import NMTOKEN #IMPLIED``` + +We could define an additional attribute "alt" and then use that to match the CLDR \ alt attribute: + +``` + + + + + + + + + … + + + + + + + + … + + +``` + +### C. ICU txt resource structure + +The ICU xml files (and the CLDR xml files, for suppressions data) are processed by CLDR tools such as cldr/trunk/tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to generate the text resources, for example: + +``` +root{ + + boundaries{ + + grapheme:process(dependency){"char.brk"} + + line:process(dependency){"line.brk"} + + … + + word:process(dependency){"word.brk"} + + } + + dictionaries{ + + Hani:process(dependency){"cjdict.dict"} + + Hira:process(dependency){"cjdict.dict"} + + ... + + Thai:process(dependency){"thaidict.dict"} + + } + +} + +xx{ + + boundaries{ + + line:process(dependency){"line_xx.brk"} + + } + + exceptions{ + + SentenceBreak:array{ + + "Mr.", + + "Etc.", + + … + + } + + } + +} +``` + +These files are read by BreakIterator::buildInstance(...) in ICU4C, with a type parameter that maps directly to the key in the boundaries resource: "grapheme", "line", etc. Currently there is not a way to add attributes for the boundaries subelements such as line or word. However, we could map the icu:alt values proposed in section C to resource keys with extensions where appropriate: + +``` +boundaries{ + + grapheme:process(dependency){"char.brk"} + + grapheme_extended:process(dependency){"char.brk"} + + grapheme_legacy:process(dependency){"char_legacy.brk"} + + … + + line:process(dependency){"line.brk"} + + line_normal:process(dependency){"line.brk"} + + line_strict:process(dependency){"line_strict.brk"} + + … + +} +``` + +BreakIterator::buildInstance is called by BreakIterator::makeInstance, which provides the type keys "grapheme", "line", etc. It could use the locale to construct the resource keys with extensions. + +### D. Current dictionary break implementation + +(See also the [relevant section of the ICU User Guide](http://userguide.icu-project.org/boundaryanalysis#TOC-Details-about-Dictionary-Based-Break-Iteration)) + +The use of dictionary break depends on the existence in the rules of a variable "$dictionary" which defines the UnicodeSet of characters for which dictionary break should be used. + +For line break, this is defined as “```$dictionary = [:LineBreak = Complex_Context:];```” where the Line\_Break property value Complex\_Context is equivalent to SA and applies to most letters, marks, and some other signs in Southeast Asian scripts: Thai, Lao, Myanmar, Khmer, Tai Le, New Tai Lue, Tai Tham, Tai Viet, etc. For word break, in addition to characters with Line\_Break property value SA, the $dictionary set includes characters with script Han, Hiragana, Katakana, as well as composed Hangul syllables in the range \uAC00-\uD7A3 (not sure why the latter are included, since we do not have dictionary support for them). + +In both cases, the rules are defined to disallow breaks between characters in the $dictionary set. When determine the next or previous break, the iterator first determines the break using the normal rules (which will not break between characters in the $dictionary set); in the process it marks which characters are handled by a dictionary break engine (For each script that has a break dictionary, the associated break engine defines a more specific set of characters to which it applies). If characters handled by a dictionary break engine were encountered, the break iterator then invokes the dictionary break engines to determine breaks within the $dictionary-set span. + +### E. Multiple rule sets that depend on break type + +It would be nice for a given locale to be able to specify, for each break type, which variant is the default for that locale. In root this can just be done by using the resource key without any extension. In other locales, we could do something like this in the CLDR XML: + +``` + + + + + +``` + +## V. Acknowledgments + +Thanks to Koji Ishii and the CLDR team for feedback on this document. + + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/suggested-exemplar-revisions.md b/docs/site/development/development-process/design-proposals/suggested-exemplar-revisions.md new file mode 100644 index 00000000000..78e06c08844 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/suggested-exemplar-revisions.md @@ -0,0 +1,107 @@ +--- +title: Suggested Exemplar Revisions +--- + +# Suggested Exemplar Revisions + +I've been doing some analysis of character frequency, and on that basis have the following recommendations for changes to the exemplar characters. + +As a reminder, the **main** exemplar characters are those that are required for native words in modern customary use. For example, "a-z" suffice for English. The **aux** exemplar characters include other characters that would not be unexpected in common non-technical publications: those that are native but not required (eg, *ö* as in *coöperate*, or "pronunced *dāvĭs*"), foreign loan words (eg, *résumé*). [A useful source of the aux letter are newspaper style guidelines.] + +There is a breakdown on http://www.unicode.org/repos/cldr-tmp/trunk/dropbox/mark/exemplars/summary.html. Note the following: + +- Characters (more precisely, combining character sequences) are given in rough frequency order, in boxes colored according to the relationship to the CLDR exemplar sets. +- Some of the changes below have already been incorporated (marked in italic). Note that the characters inside the boxes are not ordered by frequency, but the boxes are (rightmost containing more frequent characters). +- There is some noise in the system, so don't give too much weight to characters towards the left (they are all trimmed to no more than 1000 characters), or for languages without much presence on the web. +- The characters are partially normalized (width, Arabic shapes, NFC). +- The characters are from a sample of the web, about 800M docs, and 5T characters. + +Here are my suggestions. Please send feedback to [mark@macchiato.com](mailto:mark@macchiato.com) with any other suggestions, or add comments to http://unicode.org/cldr/trac/ticket/2789. + +### Suggestions + +1. For all Latin-script languages, add [a-z] into the aux set. *For others, check that any Latin script characters are deliberate, and either include all of a-z or none (I found in Tamil: \[a g i m t]\)* +2. zh shows the following as high-frequency characters but not in exemplars 网 机 产 册 没 只 帖 万 ... . Consider adding to main. There are some other high frequency characters in aux, that probably should be in main: 线 录 户 +3. I collected some draft info on languages\* we don't have, expressed in code at the end of this document. Consider adding locales to at least encompass this information. +4. ja doesn't include the following, should probably be in aux: 岡 阪 奈 藤 俺 伊 誰... +5. es should have ª in main, check that aux covers (French/German/Danish/Port.) +6. de *should have ß in main*, check aux covers French/Spanish/Danish/Turkish +7. pt should have ª º in main, French/Spanish/German/Danish in main +8. ko should have jamo in aux, and: 中 人 北 大 女 完 文 日 本 的 美 語... +9. it should check French/German/Danish in aux +10. In general, we should see which languages follow the convention of using trema to separate digraph vowels (eg naïve), and add the 6 vowels with trema to aux, at least. +11. in aux should cover French/Dutch +12. tr aux should cover French/German +13. zh-Hant: should look at 只 帖 搜 壇 .. +14. nl aux should cover French/Spanish/German/Danish +15. pl aux should cover French/German +16. fil aux should cover Spanish/French +17. qu\* aux should cover Spanish +18. hu aux should cover German +19. el aux should cover polytonic greek +20. fi aux should cover French, German +21. We should include non-Western decimal digits into the corresponding exemplars +22. fa aux should include Arabic; ar aux should include Persian +23. da aux should cover French/German/Spanish +24. ca aux should cover Spanish +25. All Cyrillic aux should cover Russian +26. eu (Basque) aux should cover Spanish/French +27. ku-Arab aux should cover Arabic/Persian +28. br (Breton) aux should cover French + +### Additional exemplar sets + +- qu - Quechua [pt{ch}kq{pʼ}{tʼ}{chʼ}{kʼ}{qʼ}{ph}{th}{chh}{kh}{qh}s{sh}hmnjl{ll}rwyñaiu] +- co - Corsican [abc{chj}defg{ghj}hijlmnopqrstuvz] +- fy - West Frisian [a b c d e f g h i y j k l m n o p q r s t u v w x zâ ê é ô û ú] +- bho - Bhojpuri [:sc=deva:] +- gd - Scottish Gaelic [abcdefghilmnoprstuàèìòù], aux: [á é ó] +- ht - Haitian Creole [a{an}b{ch}de{en}èfgijklmnoò{on}{ou}prst{tch}vwyz] +- jv - Javanese [a b c d e é è f g h i j k l m n o p q r s t u v w x y z] +- la - Latin [abcdefghiklmnopqrstuxyz] +- lb - Luxembourgish "[a-z é ä ë] +- sd - Sindhi [ا ب ٻ پ ڀ ت ث ٺ ٽ ٿ ج ڃ ڄ چ ڇ ح-ذ ڊ ڌ ڍ ڏ ر ز ڙ س-غ ف ڦ ق ک ڪ گ ڱ ڳ ل-ن ڻ ه ھ و ي] +- su - Sundanese [aeiouépbtdkgcjh{ng}{ny}mnswlry] +- gn - Guaraní = gug [a-vx-zá é í ó ú ý ñ ã ẽ ĩ õ ũ ỹ {g\u0303}] + +From Bug 1947, for reference. + +The exemplar character set for ja appears to be too small. + +1. It contains about 2,000 characters (Kanji, Hiragana and Katakana). +2. If Exemplar Character set is limited to the most widely used one (Level 1 +Kanji? in JIS X 208), I expected Auxiliary Exemplar Character set to contain the + +rest of + +JIS X 0208 (plus JIS X 212 / 213). However, it contains only 5 characters. + +3. It does not contain \ ('composed Katakana letters'), U+30FB and U+30FC (conjunction and length marks). + +For instance, characters like U+4EDD,U+66D9, U+7DBE are not included although they're used in Japanese IDN names (which is an indicator that they're pretty widely used. See ) + +While I was at it, I also looked at zh\* and ko. All of them have about 2000 characters (in case of 2350 which is the number of Hangul syllables in KS X 1001). The auxiliary sets for zh\* have only tens of characters (26 for zh\_Hans +and 33 for zh\_Hant). + +It's rather inconvenient to type hundreds (if not thousands) of characters in the CLDR survey tool. Perhaps, we have to fill in those values ('candidate sets' for vetting) using cvs before the next round of CLDR survey. + +... + +Jungshik and I discussed this, and there are three possible sources (for each of Chinese (S+T), Japanese, and Korean) that we could tie the exemplars to: + +1. charsets (in the case of Japanese, this would be probably: JIS 208 + 212 + 213. (This would be a large set, and +contain many rarely-used characters).

+1a. Only use JIS 208. (The current approach appears to be JIS 208, but only level 1.) + +2. Use the educational standards in each country/territory for primary+secondary requirements. We'd have to +look up sources for these. + +3. Use the NIC restrictions for each country. + +These would all overlap to a large degree, but wouldn't be the same. One possibility is to issue a PRI for public review. + +There is a fourth possibility: Use the characters that are supported by the commonly-used fonts on various + +platforms for these languages (e.g. the characters that are in the cmaps for [TrueType?](#BAD_URL) fonts). + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/supported-numberingsystems.md b/docs/site/development/development-process/design-proposals/supported-numberingsystems.md new file mode 100644 index 00000000000..6dbd6cc9d06 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/supported-numberingsystems.md @@ -0,0 +1,760 @@ +--- +title: Supported NumberingSystems +--- + +# Supported NumberingSystems + +Per ticket #3516 and 4097 - we need a way to specify which numbering systems are supported in a particular locale. + +We currently only have a single field, that defines the default numbering system for a locale, as follows: + +\latn\ + +There are other categories of numbering systems that should be defined on a per-locale basis, so that programmers can access a certain type of numbering system without necessarily knowing the specific numbering system in place. + +This proposal replaces the current "defaultNumberingSystem" field with a series of fields that denotes the different categories of numbering systems that might be desired. Although numbering systems could be categorized in a number of ways, the most common groupings would be as follows: + +\ - The default numbering system to be used for formatting numbers in the locale. + +\ - Numbering system using native digits. The "native" numbering system can only be a numeric numbering system, containing the native digits used in the locale. + +\ - The traditional or historic numbering system. Algorithmic systems are allowed in the "traditional" system. + +- May be the same as "native" for some locales, but it may be different for others, such as Tamil or Chinese. +- If "traditional" is not explicitly specified, fall back to "native". + +\ - Special numbering system used for financial quantities. If "financial" is not explicitly specified, fall back to "default". + +**BCP 47 - Locale keywords** + +\ - No keyword is required + +\ - native ( Example: ar-MA-u-nu-native is Arabic locale for Morocco, but using native digits ). + +\ - traditio ( Example: ta-IN-u-nu-trad is Tamil locale for India, using traditional numerals ). + +\ - finance ( Example: zh-Hant-TW-u-nu-finance would be Chinese locale in Tradtional Han script for Taiwan, using financial numbers ). + +Proposed seed data for numbering systems + +---------------- +**root.xml:** + +``` + + + + + latn + + latn + + + + +``` + +**am.xml:** + +``` + + + + + latn + + latn + + ethi + + + + +``` + +**ar.xml:** + +``` + + + + + arab + + arab + + + + +``` + +**ar\_DZ.xml:** ( native="arab" would be inherited from the "ar" locale ) + +``` + + + + + latn + + + + +``` + +**ar\_MA.xml:**( native="arab" would be inherited from the "ar" locale ) + +``` + + + + + latn + + + + +``` + +**ar\_TN.xml:**( native="arab" would be inherited from the "ar" locale ) + +``` + + + + + latn + + + + +``` + +**as.xml:** + +``` + + + + + latn + + beng + + + + +``` + +**bn.xml:** + +``` + + + + + latn + + beng + + + + +``` + +**bo.xml:** + +``` + + + + + latn + + tibt + + + + +``` + +**brx.xml:** + +``` + + + + + latn + + deva + + + + +``` + +**byn.xml:** + +``` + + + + + latn + + latn + + ethi + + + + +``` + +**el.xml:** + +``` + + + + + latn + + latn + + grek + + + + +``` + +**fa.xml:** + +``` + + + + + arabext + + arabext + + + + +``` + +**gu.xml:** + +``` + + + + + latn + + gujr + + + + +``` + +**he.xml:** + +``` + + + + + latn + + latn + + hebr + + + + +``` + +**hi.xml:** + +``` + + + + + latn + + deva + + + + +``` + +**hy.xml:** + +``` + + + + + latn + + latn + + armn + + + + +``` + +**ja.xml:** + +``` + + + + + latn + + hanidec + + jpan + + jpanfin + + + + + + + + latn + + latn + + geor + + + + +``` + +**km.xml:** + +``` + + + + + latn + + khmr + + + + +``` + +**kn.xml:** + +``` + + + + + latn + + knda + + + + +``` + +**kok.xml:** + +``` + + + + + latn + + deva + + + + +``` + +**ku.xml:** + +``` + + + + + arab + + arab + + + + +``` + +**lo.xml:** + +``` + + + + + latn + + laoo + + + + +``` + +**ml.xml:** + +``` + + + + + latn + + mlym + + + + +``` + +**mr.xml:** + +``` + + + + + latn + + deva + + + + +``` + +**mn\_Mong.xml:** + +``` + + + + + latn + + mong + + + + +``` + +**my.xml:** + +``` + + + + + mymr + + mymr + + + + +``` + +**ne.xml:** + +``` + + + + + latn + + deva + + + + +``` + +**om.xml:** + +``` + + + + + latn + + latn + + ethi + + + + +``` + +**or.xml:** + +``` + + + + + latn + + orya + + + + +``` + +**pa.xml:** + +``` + + + + + latn + + guru + + + + +``` + +**pa\_Arab.xml:** + +``` + + + + + arabext + + arabext + + + + +``` + +**ta.xml:** + +``` + + + + + latn + + tamldec + + taml + + + + +``` + +**te.xml:** + +``` + + + + + latn + + telu + + + + +``` + +**th.xml:** + +``` + + + + + latn + + thai + + + + +``` + +**ti.xml:** + +``` + + + + + latn + + latn + + ethi + + + + +``` + +**tig.xml:** + +``` + + + + + latn + + latn + + ethi + + + + +``` + +**ur.xml:** + +``` + + + + + latn + + arabext + + + + + + + + arabext + + arabext + + + + +``` + +**wal.xml:** + +``` + + + + + latn + + latn + + ethi + + + + +``` + +**zh.xml:** + +``` + + + + + latn + + hanidec + + hans + + hansfin + + + + + + + + latn + + hanidec + + hant + + hantfin + + + + + +... + +\ + +... + +\ + +**b) metazone -> golden zone** + +supplementalData.xml + +\ + +... + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +\ + +... + +\ + +**c) historic metazone mapping** + +metazoneInfo.xml + +\ + +... + +\ + +\ + +\ + +\ + +... + +\ + +**d) zone -> territory** + +supplementalData.xml + +\ + +... + +\ + +... + +\ + +**e) territories where multiple time zones are available** + +supplementalData.xml + +\ + +**f) Mapping between Olson ID and Unicode time zone short ID (bcp47 ids)** + +(1) bcp47/timezone.xml + +Short ID -> Olson ID + +\ + +(2) supplementalData.xml + +Olson ID -> Short ID + +\ + +... + +\ + +... + +\ + +... + +\ + +... + +\ + +**g) Windows tzid mapping** + +supplementalData.xml + +\ + +\ \ + +... + +\ + +### Hige Level Proposal + +- bcp47/timezone.xml must be there, because all keyword keys/types must reside in bcp47/\*.xml. Therefore, f) (2) \ should be deprecated (this is an invert mapping table) +- The set of canonical Unicode time zone IDs is defined by bcp47/timezone.xml. Because we do not want to maintain the set in multiple places, long ID aliases could be embedded in bcp47/timezone.xml. +- Metazone tables (b and c) should be in a single file +- Territory mapping is almost equivalent to zone.tab in tzdata (minor exception - zone.tab does not include deprecated zones). I think it is not worth maintaining the data in CLDR. Therefore, d) and e) should be deprecated / no corresponding data in 1.8 +- Windows tz mapping is independently maintained - it should be moved into a new file. Side note: A single Windows tz could be mapped to multiple zones in future. + +### New Data Organization + +1. bcp47/timezone.xml + +Add long aliases - for example + +\ + +2. metazoneInfo.xml + +Add b) into this file + +3. wintz.xml (new) + +Store only g). + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/transform-fallback.md b/docs/site/development/development-process/design-proposals/transform-fallback.md new file mode 100644 index 00000000000..4d3e1d06f26 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/transform-fallback.md @@ -0,0 +1,82 @@ +--- +title: Transform Fallback +--- + +# Transform Fallback + +We need to more clearly describe the presumed lookup fallback for transforms: + +## Code equivalence + +- A lone script code or long script name is equivalent to the BCP 47 syntax: Latn = Latin = und-Latn. +- "und" from BCP 47 is treated the same as the special code "any" in transform IDs +- In the unlikely event that we have a collision between a special transform code (any, hex, fullwidth, etc) and a BCP 47 language code, we have to figure out what to do. Initial suggestion: add "\_ZZ" to language code. +- For the special codes, we should probably switch to aliases that have a low probability of collision, eg > 3 letters always. + +## Language tag fallback + +If the source or target is a Unicode language ID, then a fallback is followed, with some additions. + +1. az\_Arab\_IR +2. az\_Arab +3. az\_IR +4. az +5. Arab +6. Cyrl + +The fallback additions are: + +- We fallback also through the country (03). This is along the lines we've otherwise discussed for BCP47 support, and that we should clarify in the spec. +- Once the language is reached, we fall back to script; first the specified script if there is one (05), then the likely script for lang (06 - if different than 05) + +## Laddered fallback + +The source, target, and varient use "laddered" fallback. That is, in pseudo code: + +a. for variant in variant-chain + +b. for target in target-chain + +c. for source in source-chain + + transform = lookup source-target/variant + + if transform != null return transform + +.. + +For example, here is the chain for ru\_RU-el\_GR/BGN. I'm spacing out the source, target, and variant for clarity. + +1. ru\_RU - el\_GR /BGN +2. ru - el\_GR /BGN +3. Cyrl - el\_GR /BGN +4. ru\_RU - el /BGN +5. ru - el /BGN +6. Cyrl - el /BGN +7. ru\_RU - Grek /BGN +8. ru - Grek /BGN +9. Cyrl - Grek /BGN +10. ru\_RU - el\_GR +11. ru - el\_GR +12. Cyrl - el\_GR +13. ru\_RU - el +14. ru - el +15. Cyrl - el +16. ru\_RU - Grek +17. ru - Grek +18. Cyrl - Grek + +**Comments:** + +1. The above is not how ICU code works. That code actually discards the variant if the exact match is not found, so lines 02-09 are not queried at all. I think that is definitely a mistake. +2. Personally, I think the above chain might not be optimal; that it would be better to have BGN be stronger than country difference, but not as strong as Script. However, in conversations with Markus, I was convinced that a simple story for how it works is probably the best, and the above is simpler to explain and easier to implement. + +## Model Requirements + +We have the implicit requirement that no variant is populated unless there is a no-variant version. We need to make sure that that is maintained by the build tools and/or tests. That is, if we have fa-Latn/BGN, we should have fa-Latn as well. The other piece of this is that we should name all the no-variant versions, so that people can be explicit about the variant even in case we change the default later on. The upshot is that the no-variant version should always just be aliases to one of the variant versions. Operationally, that means the following actions: + +Case 1. only fa-Latn/BGN. Add an alias from fa-Latn to fa-Latn/BGN + +Case 2. only foo-Latn. Rename to foo-Latn/SOMETHING, and then do Case 1. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file